First sorry for my first-grade English! If you are OK with it, please read on.
Second, sorry that my question is so basic, but sometimes I see very easy questions which are surprisingly and kindly answered by the community, so I'm trying my chance to learn something here.
Can someone tell me what is the threshold for a multiple comparison testing? Based on my experience with some statisticians and reading medical/dental articles, first I thought that for being a "multiple comparison" case, a setup should follow the pattern of an ANOVA (or non-parametric alternatives). Later a friend told me every design with more than one test being performed is a case of multiple comparison, and should be adjusted properly.
I have a basic but important question: was that true? I have read lots of articles in which more than (or sometimes much more than) a test have been performed, but without any multiple comparison corrections (except those corrected by the post-hoc tests). Actually I have seen the fixing of multiple comparison only in ANOVA-like designs.
Lets assume that running more than one test increases the chance of obtaining a type I error. My fundamental question here is what must be the context where the number of tests are counted? An ANOVA? a study? all studies by a researcher? all studies in a day? or all studies? And how can someone decide what this limit is. My English and mathematics don't let me give precise scientific explanations, but my common sense still insists that this multiple comparison thing is not fully valid. OK I have read some basic articles with the convincing message that increasing the number of tests actually increases the number of P values < 0.05 by chance. However, is it "increasing the number of tests in a study (e.g., Bonferroni's correction)?" or is it "the number of tests in a part of a study (for example if we have 10 Freidmans in a study, should we fix the type-I error for each Freidman separately? or should we do it for all the possible pairwise comparisons? [the difference could be hug])"? or "in all studies"? or in what?! I.e., perhaps we might consider the whole body of research as one single study attempting to understand the world (the sample is a sophisticated composite of all the small samples). If this is true, then almost infinite number of statistical tests been and being performed can definitely disrupt all we are trying to elucidate from all these statistics (I mean all those P values < 0.05 can be actually results of millions of tests being done! and thus happened by chance. So we should work with alphas < 10^-1000 for example, rather than 0.05). My word seems to be pointless, but my question is exactly "how we can decide whether it is pointless or not?" I mean what is the logic behind this multiple-comparison decision? You might think this is so basic, but I could link to some articles published in accredited journals in which multiple comparison fixing has been done only for some specific test within a very larger setup; for example for the pairwise tests within the only significant Freidman test (out of 6 non-significant Freidmans as well as several other significant and non-significant tests such as Spearmans and chi-squares).
Thank you very much for reading, and so very much for, well, discussing/replying.