normality assumption: data itself, or residuals ?

#1
I have a question about the normality assumption, in regression, ANOVA, and T-test:

The normalization assumption is for the original data and residuals, or only residuals?

for example, here is some discussion.

I remember I was taught that when the data is not normaly distributed, I better use wilcoxon rank test, rather than T-test; and for regression and Anova, I better transform the data first (log transformation, sqrt, square, etc) to get closer to normal distribution.
But now I read what is more important is the residuals have normal distribution.

My data is heavily positively skewed, and after log transformation, it is much better. I guess I will use the transformed data for regression and anova. But right now I am really confused by this question
 

Karabiner

TS Contributor
#2
The residuals of the models should (preferably) be sampled from a normally distributed population. Not the unconditional values of the dependent variable. Moreover, if the sample is large enough, even non-normal residuals do not compromise the result of the statistical test.

Wilcoxon is no direct alternative to a t-test, since Wilcoxon (for dependent variables which are measured on an ordinal scale) doesn't test for mean differences.

Transformation is sometimes a good idea if there are inherent reasons for it and results are interpretable (e.g. often income, or time-associated variables such as reaction speed etc. could reasonably be logarithmically transformed), but not just for achieving normality.

With kind regards

K.
 
#3
Thank you very much, Karabiner. I understand that the Wilcoxon signed-rank test is the nonparametric test equivalent to the dependent t-test.
I think you are right when you said Wilcoxon does not test the mean difference between two groups, but often people say it is an alternate for t-test. This causes some confusion.
In the wiki, it says:
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e. it is a paired difference test). It can be used as an alternative to the paired Student's t-test, t-test for matched pairs, or the t-test for dependent samples when the population cannot be assumed to be normally distributed.
 

CowboyBear

Super Moderator
#4
I have a question about the normality assumption, in regression, ANOVA, and T-test:

The normalization assumption is for the original data and residuals, or only residuals?
The normality assumption applies to the error terms. (Errors are a slightly different concept from residuals, though we examine the residuals to get an idea of what might be happening with the errors). The assumption is definitely not about the data itself.

Note that the normal-errors assumption is not required for the ordinary least squares estimator to be unbiased, consistent, and efficient (in the sense of being a BLUE estimator). Neither is it required for the asymptotic distribution of the sample coefficients to be normal - i.e., with large-ish samples, its irrelevant. It is only required for an assurance that the sampling distribution of the coefficients will be normal if using small samples (and thus that significance tests and confidence intervals will be trustworthy, even if the sample is small).

This is a very common question on this forum, and we have written an article about it that you can read here: http://pareonline.net/getvn.asp?v=18&n=11