When to assume Normality

#1
I am working with a rather large sample (approximately 120). Typically, I would not have any issues justifying normality due to the large sample size and that the Central Limit Theorem would allow me to make that justification.

However, of the 120 or so samples, there are about 8 extreme outliers and 3 minor outliers. The skewness and kurtosis are also far larger than I would expect from a normal distribution.

Even with all of the outliers and high skewness/kurtosis levels, am I justified in assuming normality based solely on the large sample size?

Thank you, any insight would be greatly appreciated.
 

trinker

ggplot2orBust
#2
Normality is not an assumtption of your sample. It's an assumtion of the population. You should have normally distibuted error terms (residuals). This can be checked with a qq plot (visually). Generally regression and anova are fairly robust to the normality assumption, but checking the residuals is the way to be assured you're not violating the assumption. It's pretty simple to do with most stats programs.
 

Dason

Ambassador to the humans
#3
What is your goal? Some sort of hypothesis test? It makes a difference. Typically the assumption of normality isn't on the data itself but it on the sampling distribution of some statistic. The CLT usually applies to this statistic but has little to do with the data itself.

One way to assess whether or not the assumption of normality on the test statistic still seems reasonable even with outliers is to bootstrap the sampling distribution using your observed sample. I don't know what software you use but I'm most familiar with R and it's very easy to do this in R.
 
#4
I need to decide whether or not to use a parametric paired t-test or some type of non-parametric signed-rank or rank test (hypothesis test).

Typically, when I work with smaller sample sizes, I would check such things as Shapiro-Wilk, outliers, skewness and kurtosis. Assuming that none of these raised any red flags (such as Shapiro-Wilk p-value<.05, extreme outliers....) I would assume normality and proceed with a parametric test.

However, when working with large sample sizes (such as those larger than 100), I have often been told that it is sufficient to simply look at the distribution of the sample and if it is roughly normal then it is fine to assume normality and use a parametric test.

This particular data set I'm working with now, however, is throwing me for a loop. The sample size is sufficiently large (about 120) however, it does not appear normal whatsoever - (the histogram is nowhere near close to being considered normal, the QQ-plot is not linear, the skewness/kurtosis values are very high).

Thus, I'm not sure whether I am justified in using a parametric test (in this case, a paired t-test)?
 
Last edited:

Dason

Ambassador to the humans
#5
In the case of a paired t-test the assumption is that test statistic will have a t-distribution. The tests that you're doing are testing whether the differences themselves have a normal distribution. If the differences had a normal distribution then the test statistic will have the appropriate distribution it isn't required.

Like I said I would probably just bootstrap the sampling distribution to and see if that looks approximately normally distributed. If you've never done bootstrapping it's a fairly nice concept and depending on your software is very easy to do. So what software are you using?

Edit: I should note that there is nothing wrong with using non-parametrics and it's probably a better approach. But if you're interested in figuring out whether a parametric approach is still appropriate then bootstrapping will help you assess that.
 
#6
I need to decide whether or not to use a parametric paired t-test or some type of non-parametric signed-rank or rank test (hypothesis test).

Typically, when I work with smaller sample sizes, I would check such things as Shapiro-Wilk, outliers, skewness and kurtosis. Assuming that none of these raised any red flags (such as Shapiro-Wilk p-value<.05, extreme outliers....) I would assume normality and proceed with a parametric test.

However, when working with large sample sizes (such as those larger than 100), I have often been told that it is sufficient to simply look at the distribution of the sample and if it is roughly normal then it is fine to assume normality and use a parametric test.

This particular data set I'm working with now, however, is throwing me for a loop. The sample size is sufficiently large (about 120) however, it does not appear normal whatsoever - (the histogram is nowhere near close to being considered normal, the QQ-plot is not linear, the skewness/kurtosis values are very high).

Thus, I'm not sure whether I am justified in using a parametric test (in this case, a paired t-test)?

I'm also not sure what you mean by boostrapping?

Perhaps another way to rephrase my question: Even with large sample sizes (greater than 100) how important is it that things such as skewness/kurtosis, outliers...are not out of whack?
 

Dason

Ambassador to the humans
#7
It depends. Bootstrapping is one way to assess whether the outliers and skewness make that much of a difference. My guess is that with the sample size you have you're probably fine - but it is still good to try to formally assess these things.

You could read up on bootstrapping a little bit - I think it's a good tool to have in anybody's statistical arsenal. But you've also failed to answer my question about what software you're using once again...
 

Dason

Ambassador to the humans
#9
I'm not very well versed on bootstrapping in SAS. The concept is fairly simple though:
1) Calculate your n differences
2) Randomly sample (with replacement) a sample of size n from the differences.
3) Compute the mean of this sample and store it
4) Repeat steps 2-3 k times (where k is typically 1000-10000).

The means you obtain in step 3 give you an approximation to the sampling distribution of the mean (which is what you're actually interested in). If this distribution looks approximately normal then you're in good shape.
 
#10
After I did the boostrapping thing (I did 10,000 replications), I was left with a sampling distribution that looked very normal. HOWEVER, there was still this very long tail (account for about 5-10% of the data) which was heavily skewed. What is your opinion on whether or not you would be fine with this long tail and just assume normality?

Below is my best attempt to re-create the accompanying sampling distribution I obtained after 10,000 samples. Would you still assume normality or would the long tail give you cause for concern on perhaps just performing a non-parametric test?

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
***
*****
*******
*********
***********
********
******
***
**
*
*
*
 

Dason

Ambassador to the humans
#11
It is probably a small concern. You would probably be fine running a parametric test but really if you run a non-parametric test you don't even need to be that concerned. It seems like since we're still left with (slight) doubts about the assumptions it would probably just be safest to go with the non-parametric test.