It depends on what you want to do - e.g. the t-test is quite robust against deviations from normality. If you do an ANOVA, only the residuals have to be normal .. etc
regards
Hello,
for using parametric tests, is it required that the sample data are normally distributed or is it sufficient to know from other similar types of experiments that the population data are normally distributed?
In samples, which are obviously smaller than the population, a few extreme values may ruin the normal distribution but the population as a whole can still have a normal distribution.
It depends on what you want to do - e.g. the t-test is quite robust against deviations from normality. If you do an ANOVA, only the residuals have to be normal .. etc
regards
I have already heard that several times about the t-test. I just don't know what "quite robust" exactly means. How do I know when the deviation is OK and when it is too much? Is there a rule of thumb for this?
Actually, t-tests (standard paired/unpaired, unpaired with Welch correction) and ANOVA (mainly one-way) as well as (post-hoc) multiple comparison tests are what I am interested in.If you do an ANOVA, only the residuals have to be normal .. etc
regards
I believe with ANOVA, the distributions of Y values need to each be normally distributed with a common variance rather than the error term (generalization of a t-test). The error term needs a normal distribution in the context of an ordinary least squares regression. The part the confuses me a bit is that a simple regression using only a qualitative variable is equivalent to an ANOVA with that same independent variable as a factor. However, I think the context and the goal of the inference can maybe help with this "discrepancy". At one point, I believe I heard that a normal distribution of errors implies a normal distribution of Y values, but I could be mistaken, but I had always learned ANOVA assumptions with respect to the DV and regression assumptions with respect to the error term.
For parametric tests not the unconditional data should be sampled from a normally distributed population, but the residuals of the model (e.g. from a regression equation, or from an ANOVA model) should be a sample from a normally distributed population.for using parametric tests, is it required that the sample data are normally distributed or is it sufficient to know from other similar types of experiments that the population data are normally distributed?
But even this assumption is needed only for small samples. If n > 30, according to the central limit theorem the test statistics are not biased, even if the residuals are from a non-normal population.
HTH
Karabiner
»Jetzt kann mich der Führer mal am Arsch lecken.« (Ernst Kuzorra, 1941)
The following is a very clear and understandable explanation: Checking the Normality Assumption for an ANOVA Model
Glad to see I wasn't far off and the explanation helps link them! It's seems like one of those obvious things when you think of the assumption and how it plays out-- essentially exactly what is done in that article. Thank you. The one bone I pick with the explanation is that errors and residuals are not the same thing, at least as I was taught in my stats courses and in the statistics books I have read. The error is a theoretical quantity that is unobservable while the residual is an observable, sample estimate of the error of prediction. Again, that's how I was taught by a couple of statisticians and different books. Although, it's somewhat of a smaller point. What are your thoughts on that?
It's helpful to keep in mind that "small" and 30 are not exactly hard lines-- much larger samples may be needed for data that are drawn from increasingly non-normal distributions. Thirty may be more than plenty for a slightly non-normal distribution, but 10000 or more may be required for something from a multimodal, heavily skewed distribution.
Last edited by ondansetron; 11-10-2017 at 04:17 PM.
I think the confusion arises because you have both residuals and errors. See the attached images. You have a table of residuals, which are the differences between the observed and predicted values. But you also have the ANOVA table that has an error term that is an aggregate of the residuals. The two are definitely related, but as you said they are different.
»Jetzt kann mich der Führer mal am Arsch lecken.« (Ernst Kuzorra, 1941)
They're saying there is nothing magical about n=30 and depending on the characteristics of the population you may need larger sample sizes to get the distribution of the test statistic to be approximately normal.
I don't have emotions and sometimes that makes me very sad.
No, nothing magical with 30. I have not yet seen any simulation which required more than n = 30 to 40 or so, in order to deliver approxiately normally distributed test statistics, even with markedly non-normal distributions of the residuals, e.g. uniform, extemely skewed, bimodal. But anyway. My problem is the "10000 or more" notion, which really is surprising (at least for me).
With kind regards
Karabiner
»Jetzt kann mich der Führer mal am Arsch lecken.« (Ernst Kuzorra, 1941)
Yeah that might be a bit extreme... Or it may not depending on what context you're talking about. Extremely rare events modeled using logistic regression? We need very large sample sizes to get that to work well and for any inferences based on normal theory to be valid.
I don't have emotions and sometimes that makes me very sad.
Dason explained what I meant, however the 10,000 number was more of an illustrative figure. Although, in the case of something drawn from a Cauchy distribution, there is no sample size that is large enough to get a roughly normal sampling distribution, as far as I am aware. It is a good example of CLT noncompliance.
rogojel (11-11-2017)
Hard cases make bad law...
»Jetzt kann mich der Führer mal am Arsch lecken.« (Ernst Kuzorra, 1941)
Tweet |