Normality assumption in ANOVA

#1
Hi All,

I'm a PhD student in molecular biology and I'm doing an experiment where I'm investigating the differences in protein levels (measured as mean intensities of the IHC images produced using a particular antibody, number of images per cell line, n =5) between four different cell lines. I am planning on doing a one-way ANOVA to determine whether there are any statistically significant differences between the cell lines. I know that one of the assumptions of ANOVA is that the experimental errors are normally distributed. I have used the Shapiro Wilks Test and determined that the data is normally distributed within each cell line. However, when I combined all 4 cell lines to produce a large dataset, the data was not normally distributed. I was wondering if I could proceed with my ANOVA in this case or should I do a non-parametric test, in other words, does ANOVA assume the normality within each particular group or the normality of all data points (coming from all 4 cell lines in my case) combined?

Best Wishes,

pito22
 

Miner

TS Contributor
#2
You said it yourself. ANOVA assumes that the errors (residuals) are normally distributed. This is not the data.

Also, think this through logically Take two normal distributions. Shift the mean of one distribution three standard deviations away from the mean of the first distribution. Now combine the data. This will result in a pronounced bimodal distribution. Do a normality test. It will fail. Now add in two more distributions at various distances from the first and combine them. They will fail normality, yet they are what an ANOVA was designed to detect. Normality of the combined data is irrelevant.

Normality of the data is often listed as an assumption, but is not critical. ANOVA is very robust against this assumption, so if residuals are fairly normal, you are in good shape.
 
Last edited:
#3
Normality will not bias the slopes. It only impacts the standard errors and so statistical tests. With a lot of data it is less important because of the central limit theorem period.
 

Karabiner

TS Contributor
#4
I have used the Shapiro Wilks Test and determined that the data is normally distributed within each cell line.
This is a severe misunderstanding of your test and its result. You tested the null hypothesis that your data (n=5)
are sampled from a normal distribution. With such a tiny sample size it is nearly impossible to reject that null hypothesis.
But the fact that you had not enough power to reject the Null does not mean that it is true. You could at the same time
have tested for binomial distribution, and for weibull distribution, and I bet this would have resulted in a non-significant
result either, due to very poor statistical power. Can someone assume that the variable is normally distributed and
binomially distributed and Weibull distributed in the groups at the same time?
However, when I combined all 4 cell lines to produce a large dataset, the data was not normally distributed. I was wondering if I could proceed with my ANOVA in this case or should I do a non-parametric test, in other words, does ANOVA assume the normality within each particular group or the normality of all data points (coming from all 4 cell lines in my case) combined?
With n=20, the analysis of variance might be sensitive for departures of the residuals
from normality. A Kruskal-Wallis H test would at least be a safe choice.

With kind regards

Karabiner
 
#5
The test of normality are not well thought of anyway. I would run a qq plot instead, but with only 20 cases I would question almost any analysis statistically and in terms of generalizability. Is it impossible to get more data?