# When is data 'nearly normal' - enough to break parametric rules

#### statsnooby

##### New Member
Hi everyone,

I am a student who has carried out a renewable energy experiment for a dissertation project. My dependent variable is total biogas production. I designed an experiment to hopefully look at the influences on biogas production of pre-treatment (Yes or No), Enzyme (Yes or No) and sample volume (100, 200, 300 and 1000ml). I have data for a fully factorial experiment of 2x2x4. Each test was carried out in triplicate so my overall size is N=48.

I have been using the Shapiro-wilk tests to examine normality and attempted to transform my data a fair few ways to get a normal distribution but failed in some cases. When normality was met I used an ANOVA which was fine. For non-normal data comparisons I was able to use the kruskal-wallis, then multiple Mann-whitney U for post-hoc tests.

My complication is that I am also interested in all the interaction effects. I have been taught that when normality tests or levenes tests are significant go to non-parametric tests. However during my own reading here and in other places, people seem happy to break this rule and use parametric tests anyway. I have read in more than one place that an ANOVA and multiple comparisons can be used when mathematical normality tests fail but “the distribution of each group is close to normal”. Sorry if this is a silly question but I don’t know what is satisfactory in terms of close to normal?

Also my supervisor says that interactions can’t be tested when all comparable groups don’t have a normal distribution? After doing a web search I think that generalized linear model maybe be able to tell me about the significance and effect size of any interactions – if the data can be fitted to a particular skew such as the Poisson shape.

I (crudely) studied the interaction between enzyme and pre-treatment by using a mann-whitney U test to study Pre-treatment when Enzyme-Yes and Pre-treatment when Enzyme-NO (and vice versa). Although I believe this will only tell me something useful if 1 causes the other to become either no-longer significant, newly significant or significant in the opposite direction. I.e. I would like an approach that would highlight if a significant positive increase caused by enzyme is significantly increased further when using pre-treatment.

In summary my questions are where can I learn (or please can someone explain) how to know when it is ok to break the normality ‘rules’ and proceed with an ANOVA?

And; hypothetically in a case when all data comparisons are non-normally distributed can interactions be tested? And if so is a three-way comparison a bad idea as for each case here sample size would be n=3 (my triplicate of each condition).

I don’t expect anyone to do my home work but any pointers would be greatly appreciated.

Thank you already as this forum has given me much of this basic knowledge.

#### Dason

How exactly are you assessing the normality of the data within a group if you only have 3 observations?

Note that the assumptions for ANOVA/regression state that the error term is normally distributed. Typically what this means is that we run the analysis and then look at the residuals to assess that assumption.

#### statsnooby

##### New Member
Hi Dason,

I tested normality and homogeneity of the data using the Shapiro-wilk and levenes tests in SPSS on the raw data (and raw data transformed), not the residual errors. I used it to check the normality of Enzyme (n=24) Vs No-Enzyme (n=24), pre-treatment (n=24) Vs No-pre-treatment (n=24) and the same ways volume in which each (N=12).

To test the influence of pre-treatment or volume in enzyme for example I repeated the above with N for pre-treatment and volume being 12 and 6 respectively. I haven’t gone any further as I thought that studying when N=3 was too small to look at the 3-way interaction (which I am taking from your reply is the case?). I was also worried that I was doing it all incorrectly. I am hoping that with the lowest group size of N=6, I have enough data points to study two-way interactions.

After your response and reading some more - Am I correct in thinking that the Shapiro-wilk test on the raw-data or transformed raw-data is irrelevant?

Is my answer instead to learn how to save all the residual errors and use those for the normality tests which I did instead?

I have attached my data if it helps to answer my questions about approach

Many thanks,

#### GretaGarbo

##### Human
It has been said a million times before, but here it goes again.....

"It is not the data it self you see, it it the residuals that is supposed to be normally distributed....."

Or to say the same thing again, it is not the dependent variable Y itself, but rather the dependent variable conditional on the experimental variables, Y|x, that is supposed to be normal. Or rather that Y|x has a known reference distribution, like the Poisson distribution or the gamma distribution (or other distributions in the exponential family). Those distributions can be used in a generalized linear model.

So yes, it is irrelevant to use Shapiro-Wilk test on the raw data, since they, the raw data, are NOT assumed to be normally distributed. Assume that enzyme has a huge effect on the dependent variable. Of course the raw data will have two "bumps" (i.e. be bimodal) and thus not be normal. (But the anova model would still be appropriate since it is the residuals....blah, blah...)

It is even quite irrelevant to do a Shapiro-Wilk test on the residuals. Why? Shapiro-Wilk test, and other similar tests, are designed to have high power and detect the slightest non-normality. But analysis of variance (anova) and t-tests are fairly robust tests even if the residuals deviates somewhat from normality. So if a Shapiro-Wilk test detects a moderate non-normality that the anova is robust to, then what is the use of doing that test? I would rather do a histogram on the residuals and QQplot of the residuals. (If the points on a QQplot fall on roughly straight line, then it is good-enough-normally-distributed.)

Anova is robust to non-normality and non constant residual variance, but it is certainly not robust to outliers. That always need to be checked.

I have been taught that when normality tests or levenes tests are significant go to non-parametric tests.
I don't agree with those who have taught you.

For non-normal data comparisons I was able to use the kruskal-wallis, then multiple Mann-whitney U for post-hoc tests.
Suppose there are three factors with large, but not huge effect. Kruskal-Wallis and Mann-Whitney are response variable versus ONE explanatory variable. The influence from the other two not included factors, might cause such a variation that the Kruskal-Wallis test might not detect the influence from the first and included factor. So maybe no factor would be identified as significant. Such a procedure might be vary harmful.

I looked at your data and I would not be worried about non-normality of the residuals. (By the way: all the main effects, the two factor interactions and the three factor interaction were statistically significant.)
Do the 3-way anova.

I think the best would be to plot the interaction plot with all effects included (all three factors). Then I think everybody will understand what it is about. (Including your supervisor!) I suggest to draw small "error lines" from each mean with the size of standard error (the pooled standard deviation divided by the sqrt(3), since 3 is the sample size in each cell.)

#### noetsi

##### No cake for spunky
Note that the normality test you cite has notoriously weak power to start with and you have few cases which will be worse. Since the null is normality it is possible you will reject the normality even if it in fact exists. A QQ plot is probably a better way to determine normality (even though it is a graphical proceedure) in nearly all cases and even more with a few cases.

I have read in more than one place that an ANOVA and multiple comparisons can be used when mathematical normality tests fail but “the distribution of each group is close to normal”. Sorry if this is a silly question but I don’t know what is satisfactory in terms of close to normal?
Because of the central limit theorem ANOVA is considered highly robust to the normality assumption. What that means in practice and what "close to normal" means in practice is one of those questions (like what a "large sample" is) that is simply not decided on or agreed on among statisticians. And I have spent years looking for that answer in text and articles.

#### statsnooby

##### New Member
Thanks so much I don't think I could have asked for a more specific answer! It is so much clearer in my head now.

#### GretaGarbo

##### Human
Of course I always agree with Noetsi, but it happens that I don't understand what he means.

Note that the normality test you cite has notoriously weak power to start with and you have few cases which will be worse. Since the null is normality it is possible you will reject the normality even if it in fact exists.
If the Shapiro-Wilk test has low power, then that means that it has low probability to reject normality, when it in fact is non-normal.

If it is in fact normal, that means that the null hypothesis is true, then if the test is correctly constructed, its error rate will only be the significance level, that is, in most cases,the 5%. So I don't believe that the "possible [that] you will reject the normality even if it in fact exists", is that very large. I believe that it is only 5%.

But what is the power of Shapiro-Wilk's test? In Wikipedia there is a link to a study. Rezali et al.

Shapiro-Wilk test seems to be the most powerful test among the one considered. And the sample size around 48 is not that small. A rule of thumb is that the sample is large if it is larger than 30. And Noetsi loves rules of thumb. The Shapiro-Wilk test seems to have power around 50% for various deviations from normality at sample size around 50.

Ref:
Razali, Nornadiah; Wah, Yap Bee (2011). "Power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests". Journal of Statistical Modeling and Analytics 2 (1): 21–33

#### noetsi

##### No cake for spunky
GretaGarbo all the normality tests that I am aware of have low power [or that is what the literature I have seen says]. So being relatively more powerful is not that important.

You are right, and I was wrong, about what low power means in terms of the null. It means that you will assume normality (the null) when you probably should not the reverse of what the OP wants.

#### Dason

GretaGarbo all the normality tests that I am aware of have low power [or that is what the literature I have seen says]. So being relatively more powerful is not that important.
I'll give you a deal: You can either give me $10000,$12000, or \$13000. You get absolutely nothing out of this deal - you just owe me that money. You do get to choose which of those options to go with though. They're all terrible options but I'm guessing you're going to try to find the best one out of the terrible options.

#### noetsi

##### No cake for spunky
How about a deal that involves dying by firing squad, electric chair or having to hear one of my public admin lectures (when I gave them) In all three cases you die, does it really matter how if all you care about is dying? The point is, I think, that when the results are all bad (you get normality wrong in all three) being somewhat less bad won't make practical difference. Again the literature I have read says that there is so little confidence in all the normality tests that it is better to try graphical options like QQ plots to any of them.

I am well, painfully actually, aware that statisticans commonly disagree. And of course I have read a tiny portion of the total literature.

#### Dason

How about a deal that involves dying by firing squad, electric chair or having to hear one of my public admin lectures (when I gave them) In all three cases you die, does it really matter how if all you care about is dying? The point is, I think, that when the results are all bad (you get normality wrong in all three) being somewhat less bad won't make practical difference. Again the literature I have read says that there is so little confidence in all the normality tests that it is better to try graphical options like QQ plots to any of them.

I am well, painfully actually, aware that statisticans commonly disagree. And of course I have read a tiny portion of the total literature.
Of course it matters how you die - there is a best and worst option in that group. And what I'm saying is that just because the choices aren't great why would you NOT choose the best option out of the bunch?

#### GretaGarbo

##### Human
I don't know who is going to pay who and for what, but while "you" are doing that, why don't you send some money to me too?

These normality test are often mentioned, and I wanted to give a link about published facts about the tests.

- - -

Going back to OP. One can imagine non of the treatment factors having no effect, and then the anova is run and nothing is significant.
Alternatively, one can imagine all three factors having an effect, the normality test (on the raw data) is run, normality rejected (due to multi-modal distribution because of the treatments are really influential), an alternative non-parametric Kruskal-Wallis is run and nothing is detected (significant) because the signal is hidden because of the non-included other factors. So nothing is discovered! What about that power?!

I would not be surprised if this is a common situation.

#### noetsi

##### No cake for spunky
Dason if you accept the null because of low power in all three cases why does it matter if some have more (but still inadequate) power? Why not use a better approach than any formal test - commonly graphical ones such as QQ plots?

I would not be surprised if power was an issue for both parametric and non-parametric tests of the main effect when you have small sample sizes (72 cases seems small to me). That is of course a separate issue than power in the normality test.

To return to another key point of the OP it is difficult for practisoners who are not experts in statistics to know when departures from assumptions such as normality actually matter - and how much they matter. I know I spent a great deal of time, and still do, trying to find literature that deals with this concretely or the related issue of robustness. What does it mean in practice for specific analysis if a method is "robust" or "moderate" departure from normality (or other assumption) is acceptable.

I would guess that no one really knows which is why it rarely is addressed.