A fundamental question!

First sorry for my first-grade English! If you are OK with it, please read on.

Second, sorry that my question is so basic, but sometimes I see very easy questions which are surprisingly and kindly answered by the community, so I'm trying my chance to learn something here.

Can someone tell me what is the threshold for a multiple comparison testing? Based on my experience with some statisticians and reading medical/dental articles, first I thought that for being a "multiple comparison" case, a setup should follow the pattern of an ANOVA (or non-parametric alternatives). Later a friend told me every design with more than one test being performed is a case of multiple comparison, and should be adjusted properly.

I have a basic but important question: was that true? I have read lots of articles in which more than (or sometimes much more than) a test have been performed, but without any multiple comparison corrections (except those corrected by the post-hoc tests). Actually I have seen the fixing of multiple comparison only in ANOVA-like designs.

Lets assume that running more than one test increases the chance of obtaining a type I error. My fundamental question here is what must be the context where the number of tests are counted? An ANOVA? a study? all studies by a researcher? all studies in a day? or all studies? And how can someone decide what this limit is. My English and mathematics don't let me give precise scientific explanations, but my common sense still insists that this multiple comparison thing is not fully valid. OK I have read some basic articles with the convincing message that increasing the number of tests actually increases the number of P values < 0.05 by chance. However, is it "increasing the number of tests in a study (e.g., Bonferroni's correction)?" or is it "the number of tests in a part of a study (for example if we have 10 Freidmans in a study, should we fix the type-I error for each Freidman separately? or should we do it for all the possible pairwise comparisons? [the difference could be hug])"? or "in all studies"? or in what?! I.e., perhaps we might consider the whole body of research as one single study attempting to understand the world (the sample is a sophisticated composite of all the small samples). If this is true, then almost infinite number of statistical tests been and being performed can definitely disrupt all we are trying to elucidate from all these statistics (I mean all those P values < 0.05 can be actually results of millions of tests being done! and thus happened by chance. So we should work with alphas < 10^-1000 for example, rather than 0.05). My word seems to be pointless, but my question is exactly "how we can decide whether it is pointless or not?" I mean what is the logic behind this multiple-comparison decision? You might think this is so basic, but I could link to some articles published in accredited journals in which multiple comparison fixing has been done only for some specific test within a very larger setup; for example for the pairwise tests within the only significant Freidman test (out of 6 non-significant Freidmans as well as several other significant and non-significant tests such as Spearmans and chi-squares).

Thank you very much for reading, and so very much for, well, discussing/replying.


Super Moderator
Second, sorry that my question is so basic
I don't think this is a basic question at all :)

Lets assume that running more than one test increases the chance of obtaining a type I error. My fundamental question here is what must be the context where the number of tests are counted? An ANOVA? a study? all studies by a researcher? all studies in a day? or all studies?
...or why not all the studies in a particular issue of a particular journal? Or all articles written by members of TalkStats? :p I don't think there's a good answer to this, really. I think the general decision rule applied by many social science researchers is probably "correct for multiple tests when SPSS prompts me to". I think I've heard suggestions that this friendly fellow may be less suspect to this particular problem, though...
Thanks for more examples :D and the conditional probability hint to search for, and also for the relief that I haven't gone crazy!

I think the general decision rule applied by many social science researchers is probably "correct for multiple tests when SPSS prompts me to"
;) Then it can be accepted to limit this type of correction to ANOVA-like tests only (not all the tests within a study). Also, then each correction should be applied independently to the post hocs of each Kruskal-Wallis/Freidman/etc [again such a relief!].
I have compared 4 means (+-SD) with 10 constant values (0 to 9) using a one-sample t. The P values computed show a consistent change. No random outliers have appeared. This makes me doubt the problem of multiple comparisons as a real problem. According to the descriptions, if it really existed, out of the 40 P values resulted from my 40 tests, some should be randomly out of the range of the other calculated Ps. But I see the P values steadily continue to get larger and larger.

Can anyone kindly explain how and why the problem of multiple comparison didn't happen here?


Fortran must die
I am not sure this is pertinant to what you are asking, but family wise error (and thus the chance of a type 1 error) increases as you use the same data for multiple test.
What if we calculate only one P value based on some data. For example I have a mean: 15 +- 2.2.... I can compare it with a constant value: 6 and record the P value.

Then compare the same mean value with 100,000 constant values (from -50000 to +49999) and record the 100,000 P values.

According to what I have cited here (first post), the first examination is an example of single comparison, but the second is a multiple comparison.

Then I can go fetch the specific P value from the comparison between my mean value and the value 6, but in the second examination (the multiple tests). I guess the P values from the first and second examinations would be exactly the same.

The problem of multiple comparisons says it would probably become smaller (type I error) in the second examination.
What if there are multiple comparisons over time.

For example what if I have some data and run some tests on it. Then delete the results, and revise my work and do some other tests on the same data. Then delete those and run some other ones. If I repeat this procedure for 100 times, is the chance of getting a false positive higher at the 100th test?

Is it a multiple comparison? Common sense tells me it is, since I think there is no difference between 100 tests performed simultaneously on a dataset, or performed one by one on a dataset.


And if there is a higher possibility to get a false positive at the 100th test, why not at the 1st test? Aren't they all involved in a single unit of multiple comparison? If so, how the nature knows I am gonna test my data for 100 times, so the nature can give me a higher chance of false-positive error at the first test too?

I wish there were some good answers.


Fortran must die
As I understand familywise error the issue is not whether you compare anything over time, but calculating multiple statistics with the same data. But that raises a question I have no answer for. It is not uncommon to generate multiple t test, f value etc in a multiple regression run. But as far as I know you don't apply familywise corrections to deal with that.

In honesty, if you are using ANOVA or t-test I would consider using a post hoc test like Tukey HSD that automatically corrects for familywise error.
Thanks noetsi for kind answers.

But that raises a question I have no answer for. It is not uncommon to generate multiple t test, f value etc in a multiple regression run. But as far as I know you don't apply familywise corrections to deal with that.
Till now I though the problem of multiple comparison is already corrected by the statistical package once calculating P values in a regression analysis. I didn't know it is compromised too.

As I understand familywise error the issue is not whether you compare anything over time, but calculating multiple statistics with the same data.
OK. By "over time" I mean "repeat" [of the tests on one single dataset]. But a repeat which lasts for sometime to finish. Otherwise all multiple comparisons are performed over time, as our statistical software do the process serially, thus one by one (but in milliseconds, instead of days). So whats the difference?

Even lets talk about Bonferroni's correction in a Freidman's test. Assume we want to test all the subgroups involved in a Freidman's test with a Wilcoxon's test, and we have 100 pairwise comparisons. According to the formula, we should adjust the level of significance to 0.05/100.

Lets assume we don't have a PC and wanna calculate all the Wilcoxon's P values manually. My question is how the nature knows in the very beginning of our tests that "there are 100 pairwise tests (on a single dataset) to come, so it should increase the false positive error rate for us by 100x"? Does it know this fact after we ran the 100th test? Or does it know this fact at first? What happens to the rate of false positive error, if we get tired and stop calculating the P values after 20 tests? If the nature has decided to give us 100x type I error (once we decided to run 100 tests), and we stop running more test after the 20th test, then the nature gets fooled! Otherwise, if the nature understands that we are running 100 tests based on counting our tests, then it won't be fooled if we stop in the middle of the process. But another problem emerges: at the first test, it would think we have only one test, so would not give us a higher chance of type I error, at the second test, it would say "OK this dataset has been tested twice, so I would double the rate of type I error for the researcher", and at the third test it would increase the error rate further. If so, the order of the tests done gets important and none of them would have a uniform level of type I error possibility.

OK, the computer too calculates all these multiple comparisons, serially. No difference between our slow method and its fast method. So how nature can understand that the SPSS is going to run a 1000-time repeated multiple comparison, when the only first test has been performed and there remains 999 other tests to come in a microsecond. (microsecond for SPSS or hours for us, no matter how we feel about these time measures. The problem is the multiple comparisons are done only serially and over time).

When digging into details, the problem of multiple comparison appears more and more confusing and somehow ridiculous to me.


Fortran must die
The points you raise, which I think are important ones, are beyond my expertise. I am confused by these type of issues as well. That is why I suggest Tukey's HSD - the software calculates this for you and the test is well accepted. So it apparently deals adequately with your issues (whatever the correct answers are).
Thanks noetsi. The Tukey as well has its own limitations. A Tukey can fix the problem of multiple comparison within each ANOVA setup; but if there are other tests analyzing the same setup (for example other ANOVAs, or other types of tests), the Tukey would ignore them all. While according to the rule, all of those "other" tests are too involved in that multiple comparison problem.

If I am on the right track, it appears to me this multiple comparison thing is more of a cliche (maybe even a wrong one) requested by journals, rather than something really scientific.
Last edited:
I have set the level of significance at 0.01 or sometimes 0.001 in many articles of mine to address the multiple comparison problem, since most of the studies have at least 4 or 5 separate tests on the same data. However the reviewers have criticized this as definitely wrong and got surprised as if they have seen an alien! which this is another source of doubt about this multiple comparison issue.


Ambassador to the humans
Just arbitrarily lowering the alpha probably isn't the best route. You could probably at least just justify it by using bonferonni. But you should at least mention that.
Thanks Dason. I use the Bonferroni to lower the alpha (not arbitrarily), but yes they don't know it and should be mentioned. Hope it works this time.
:D in my field it is not uncommon! I have had many reviews telling me "why you have set the alpha at 0.001, just make it correct like other studies where it is 0.05 or 0.01"! This review was from one of the most prestigious journals of orthodontics! only some of them recruit real statisticians, but the external reviewers usually suck in stats.

If I say "alpha is set at for example 0.008" [0.05/6] (not a rounded value such as 0.01) I fear my paper gets rejected in the first place due to confusion of the reviewer!, or even if not, I would certainly get harsh comments on that.


New Member
This is an interesting problem that I have thought about recently. I'm no statistical expert so please take my comments with a pinch of salt.

The problem of multiple comparisons runs along these lines: Let's say I will reject a coin as unfair if I toss it 5 times and I get 5 heads in a row. If I do this with one coin I will reject a fair coin with p=0.03125. But if test 100 coins in sequence and I don't care which one I reject, I would expect to reject at least one fair coin with p = 0.958. In other words, 95% of the time I will reject at least one fair coin with this method.

Therefore I should be careful of labelling this coin as unfair because it was determined unfair on the basis of multiple comparisons each of which had p=0.03125 but overall had p=0.958 of rejecting at least one coin.

In statistics, this is a problem when someone has a large dataset with many variables. If, for example, you had 100 variables and wanted to see which one of them had a predictive effect on height, for example, if you ran multiple individual comparisons with each of the 100 variables you are likely to find at least one that has an association with P<0.05 much more than 5% of the time. Therefore, it is important to only perform comparisons that have some kind of scientific merit, in other words that you might have expected to be associated with height prior to doing the tests. The more unnecessary comparisons that you perform, the greater a chance of type 1 error.

This is the same problem that is encountered if you do multiple blood tests for a patient without due indication - you are likely to get spurious results. That is why a doctor has to come to some sort of differential diagnosis before ordering blood tests.

On the other hand, it is not always necessary to perform a Bonferroni (or other correction) to negate this effect. By doing so you may increase the chance of type 2 error and simply not find associations where one may well exist.

Therefore, it is more important simply to bear in mind that you might get a spurious correlation with multiple tests. If, however, despite this increased risk of type 1 error, you feel the correlation that you have found has some kind of merit, you can then test this hypothesis by running an experiment properly designed (and adequately powered) to test this single hypothesis to confirm whether the association found is true or arose by chance. An example in my coin flipping example would be to subject this single coin to a second test where if it came up with 10 heads in a row it was then subsequently confirmed to be unfair. This would then reduce the chance of rejecting the coin if it was fair to p=0.0010 despite the original p=0.958 of the original multiple comparison.

Two good papers on this are given below:

"Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials"


"What’s wrong with Bonferroni adjustments"


I hope that this helps....
Dear SiBorg77 thanks a lotttt for your kind extensive comment. I really appreciate the comments and the links. However, like yourself, I already know these basics and a little more. The main question asked here in the first post and some next posts (the answer of which not available on the net AFAIK) was a little more elusive than topics like definitions of P, power, alpha, P values randomly getting significant in multiple testing, Bonferroni's or any other multiple testing correction methods which can reduce type I error at the cost of reducing power, etc. :)

Dear Dason, thanks for the site. Enjoyed. But plz check my first post.
Last edited:


Super Moderator
I wonder if it'd be useful to reframe this problem a little. We can make a rough reasoning sequence to justify using corrections for multiple comparisons as follows:

  1. The probability of a Type I error, given that the null is true, is equal to alpha (usually, .05) for a single test
  2. However, researchers may often perform multiple tests, and tend to be rewarded for significant findings.
  3. But, when testing multiple null hypotheses, all of which are true, the probability of at least one Type I error becomes large (alpha*number of tests, I think)
  4. So the idea comes about that rather than being happy with a 5% chance of a Type I error in a single test, given that the null is true, no more than a 5% chance of any Type I error (given that all tested nulls are true), is acceptable within a particular set of tests

The major problem here, as you've alluded to Victor, is that there seems to be no satisfactory way to decide what that set of tests (the "family") should be defined as. Other problems include the fact that Type II errors actually matter too, and the fact that the null hypotheses we're so worried about are typically implausible anyway. How plausible is it that any complete set of null hypotheses are all true, especially when the nulls typically specify that particular parameters such as mean differences are exactly zero?

I half-jokingly referred to Bayes' theorem earlier, but I really think that this is one of many issues where NHST has major problems and where a Bayesian approach might be better. In a Bayesian approach, we can actually calculate the probability that a particular hypothesis is true (not just the probability of observing some data given that the hypothesis is true), and this probability does not depend on whether or not we happen to have run some other tests too.