Well, no two groups will ever start *exactly* the same, even with large sample sizes and randomization. The question is whether they start *too* different to be able to compare directly.

But also, you can start with two groups that are significantly different and still potentially get some useful info by comparing difference scores. It doesn't give you the info you really want, but it can potentially give you some useful info. If Group A starts at 40 and Group B at 100, then Group A (Treatment X) goes down 3 and Group B (Treatment Y) goes down 65, we can compare the mean difference scores and get significance to go along with that huge effect size. We *can't* say Treatment X is less effective, but we can say that the interaction of that treatment with that type of group is less effective than the interaction between Group B and its treatment. Then it requires some non-numerical, contextual interpretation of the extent to which we can generalize. It's not perfect, but in some circumstances, it's all you can do. Then it comes down to replication to see how reliable and generalizable the results are.

*shrug*