P-Value in overall ANOVA vs. TukeyHSD

Hey Everyone,

I've been learning stats on the side and have some difficulty understanding the different methods of comparing means.

I used R and would have posted in their forum, except this is more theory. When I used the anova function on a data set, I got a low p-value and high F - which I understand is a good thing.

Df Sum Sq Mean Sq F value Pr(>F)
X1 4 1594.9 398.7 40.79 2.27e-10 ***
Residuals 24 234.6 9.8

When this worked out, I looked at TukeyHSD to explain the variation between all possible pairs of means. However, I got high P-Values.

Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = X64.5 ~ X1, data = sugar)

diff lwr upr p adj
Root Borer-Control -21.6500000 -27.227351 -16.072649 0.0000000
Stem Borer-Control -17.8000000 -23.377351 -12.222649 0.0000000
Termites-Control -17.3833333 -22.960685 -11.805982 0.0000000
TopShoot Borer-Control -19.9333333 -25.510685 -14.355982 0.0000000
Stem Borer-Root Borer 3.8500000 -1.467796 9.167796 0.2392782
Termites-Root Borer 4.2666667 -1.051129 9.584462 0.1600002
TopShoot Borer-Root Borer 1.7166667 -3.601129 7.034462 0.8738684
Termites-Stem Borer 0.4166667 -4.901129 5.734462 0.9993230
TopShoot Borer-Stem Borer -2.1333333 -7.451129 3.184462 0.7613845
TopShoot Borer-Termites -2.5500000 -7.867796 2.767796 0.6257874

The P-values are much higher than 0.05. Does this mean I should reject these tests. Is this because of the 95% confidence limit? Thanks a ton


TS Contributor
The confusion lies in several issues.

First, ANOVA is not a means test. It is an effects test. All it tells you is that your treatment (factor) has a significant effect. It does not tell you which specific treatment levels actually caused that effect. That is why you have to use a post-hoc test such as Tukey's HSD to determine which specific treatment levels caused this effect. Tukey's HSD is a means test.

Tukey's HSD has to control the family-wise error rate to your specified alpha level. To do that it must control the individual error rate to increasingly tighter levels. The more pairwise combinations you test, the tighter the control. If I read your table correctly, you have 4 pairwise comparisons (all to the Control) that account for the significant ANOVA effect. The remaining comparisons are not statistically significant and do not contribute.


Not a robit
The F-test is an overall test that your model explains more variability in the outcome than chance alone. Your "TukeyHSD" is a follow-up test, which will correct for multiple comparisons. So if I compare the weight groups of normal vs overweight and normal vs. obese, and normal vs morbid obese, then is will correct the alpha level to try and maintain your Type I error rate and not reject the null hypothesis of no difference, just by chance. Big picture, if you compare something to other groups enough times, you can find a spurious difference by chance.

How do you move forward? Well one approach may be to readjust your full model to exclude variables found not to be significant after these pairwise comparisons. Or you can keep them in the model to control for them. However, they don't significantly add anything to the model. Big picture, your overall model can be significant, even if all included variables are not, or are shown not significant after correcting for repeated tests.

Well, it looks like Miner, just beat me to posting - so I hope this is helpful and not too redundant!
Just to clarify, I would use ANOVA to say that my factor did have a definite impact on the result - because of the low p-value and high F. I would then conclude, from TukeyHSD, that the control group was significantly different from the factors but they were not significantly different from each other.

What real-world implications can I draw from this? Would it be right to say that I can't be 95% confident that there is a difference in results between the factors?


Not a robit
Yes, given a correction for multiplicity you would not be 95% confident, given sampling variability, that the true difference between groups was statistically significant per the selected a priori alpha level.

I did statistics there. Yay, until someone tells me I wrote this incorrectly I will presume that I am not incompetent!

P.S., My stats to the left are all pretty factors of "5", virtual high-fives to all!