Normal distribution of sample or population?

#1
Hello,

for using parametric tests, is it required that the sample data are normally distributed or is it sufficient to know from other similar types of experiments that the population data are normally distributed?
In samples, which are obviously smaller than the population, a few extreme values may ruin the normal distribution but the population as a whole can still have a normal distribution.
 

rogojel

TS Contributor
#2
It depends on what you want to do - e.g. the t-test is quite robust against deviations from normality. If you do an ANOVA, only the residuals have to be normal .. etc
regards
 
#3
It depends on what you want to do - e.g. the t-test is quite robust against deviations from normality.
I have already heard that several times about the t-test. I just don't know what "quite robust" exactly means. How do I know when the deviation is OK and when it is too much? Is there a rule of thumb for this?

If you do an ANOVA, only the residuals have to be normal .. etc
regards
Actually, t-tests (standard paired/unpaired, unpaired with Welch correction) and ANOVA (mainly one-way) as well as (post-hoc) multiple comparison tests are what I am interested in.
 
#4
It depends on what you want to do - e.g. the t-test is quite robust against deviations from normality. If you do an ANOVA, only the residuals have to be normal .. etc
regards
I believe with ANOVA, the distributions of Y values need to each be normally distributed with a common variance rather than the error term (generalization of a t-test). The error term needs a normal distribution in the context of an ordinary least squares regression. The part the confuses me a bit is that a simple regression using only a qualitative variable is equivalent to an ANOVA with that same independent variable as a factor. However, I think the context and the goal of the inference can maybe help with this "discrepancy". At one point, I believe I heard that a normal distribution of errors implies a normal distribution of Y values, but I could be mistaken, but I had always learned ANOVA assumptions with respect to the DV and regression assumptions with respect to the error term.
 

Karabiner

TS Contributor
#5
for using parametric tests, is it required that the sample data are normally distributed or is it sufficient to know from other similar types of experiments that the population data are normally distributed?
For parametric tests not the unconditional data should be sampled from a normally distributed population, but the residuals of the model (e.g. from a regression equation, or from an ANOVA model) should be a sample from a normally distributed population.

But even this assumption is needed only for small samples. If n > 30, according to the central limit theorem the test statistics are not biased, even if the residuals are from a non-normal population.

HTH

Karabiner
 

Miner

TS Contributor
#6
I believe with ANOVA, the distributions of Y values need to each be normally distributed with a common variance rather than the error term (generalization of a t-test). The error term needs a normal distribution in the context of an ordinary least squares regression. The part the confuses me a bit is that a simple regression using only a qualitative variable is equivalent to an ANOVA with that same independent variable as a factor. However, I think the context and the goal of the inference can maybe help with this "discrepancy". At one point, I believe I heard that a normal distribution of errors implies a normal distribution of Y values, but I could be mistaken, but I had always learned ANOVA assumptions with respect to the DV and regression assumptions with respect to the error term.
The following is a very clear and understandable explanation: Checking the Normality Assumption for an ANOVA Model
 
#7
The following is a very clear and understandable explanation: Checking the Normality Assumption for an ANOVA Model
Glad to see I wasn't far off and the explanation helps link them! It's seems like one of those obvious things when you think of the assumption and how it plays out-- essentially exactly what is done in that article. Thank you. The one bone I pick with the explanation is that errors and residuals are not the same thing, at least as I was taught in my stats courses and in the statistics books I have read. The error is a theoretical quantity that is unobservable while the residual is an observable, sample estimate of the error of prediction. Again, that's how I was taught by a couple of statisticians and different books. Although, it's somewhat of a smaller point. What are your thoughts on that?

But even this assumption is needed only for small samples. If n > 30, according to the central limit theorem the test statistics are not biased, even if the residuals are from a non-normal population.

HTH

Karabiner
It's helpful to keep in mind that "small" and 30 are not exactly hard lines-- much larger samples may be needed for data that are drawn from increasingly non-normal distributions. Thirty may be more than plenty for a slightly non-normal distribution, but 10000 or more may be required for something from a multimodal, heavily skewed distribution.
 
Last edited:

Miner

TS Contributor
#8
The one bone I pick with the explanation is that errors and residuals are not the same thing, at least as I was taught in my stats courses and in the statistics books I have read. The error is a theoretical quantity that is unobservable while the residual is an observable, sample estimate of the error of prediction. Again, that's how I was taught by a couple of statisticians and different books. Although, it's somewhat of a smaller point. What are your thoughts on that?
I think the confusion arises because you have both residuals and errors. See the attached images. You have a table of residuals, which are the differences between the observed and predicted values. But you also have the ANOVA table that has an error term that is an aggregate of the residuals. The two are definitely related, but as you said they are different.
 

Karabiner

TS Contributor
#9
It's helpful to keep in mind that "small" and 30 are not exactly hard lines-- much larger samples may be needed for data that are drawn from increasingly non-normal distributions. Thirty may be more than plenty for a slightly non-normal distribution, but 10000 or more may be required for something from a multimodal, heavily skewed distribution.
This statement is a bit surprising. What do you mean by "is required" - required for what? The test statistics calcuated are not (much) affected by non-normality of the residuals, if n > 30.

With kind regards

Karabiner
 

Dason

Ambassador to the humans
#10
They're saying there is nothing magical about n=30 and depending on the characteristics of the population you may need larger sample sizes to get the distribution of the test statistic to be approximately normal.
 

Karabiner

TS Contributor
#11
No, nothing magical with 30. I have not yet seen any simulation which required more than n = 30 to 40 or so, in order to deliver approxiately normally distributed test statistics, even with markedly non-normal distributions of the residuals, e.g. uniform, extemely skewed, bimodal. But anyway. My problem is the "10000 or more" notion, which really is surprising (at least for me).

With kind regards

Karabiner
 

Dason

Ambassador to the humans
#12
Yeah that might be a bit extreme... Or it may not depending on what context you're talking about. Extremely rare events modeled using logistic regression? We need very large sample sizes to get that to work well and for any inferences based on normal theory to be valid.
 
#13
Dason explained what I meant, however the 10,000 number was more of an illustrative figure. Although, in the case of something drawn from a Cauchy distribution, there is no sample size that is large enough to get a roughly normal sampling distribution, as far as I am aware. It is a good example of CLT noncompliance.
 

rogojel

TS Contributor
#15
Dason explained what I meant, however the 10,000 number was more of an illustrative figure. Although, in the case of something drawn from a Cauchy distribution, there is no sample size that is large enough to get a roughly normal sampling distribution, as far as I am aware. It is a good example of CLT noncompliance.
Luckily we do not get such variables often in real life - but it is a good reminder to not sample ratios of variables :)