+ Reply to Thread
Results 1 to 4 of 4

Thread: normality assumption: data itself, or residuals ?

  1. #1
    Points: 906, Level: 16
    Level completed: 6%, Points required for next Level: 94

    Posts
    23
    Thanks
    0
    Thanked 0 Times in 0 Posts

    normality assumption: data itself, or residuals ?




    I have a question about the normality assumption, in regression, ANOVA, and T-test:

    The normalization assumption is for the original data and residuals, or only residuals?

    for example, here is some discussion.

    I remember I was taught that when the data is not normaly distributed, I better use wilcoxon rank test, rather than T-test; and for regression and Anova, I better transform the data first (log transformation, sqrt, square, etc) to get closer to normal distribution.
    But now I read what is more important is the residuals have normal distribution.

    My data is heavily positively skewed, and after log transformation, it is much better. I guess I will use the transformed data for regression and anova. But right now I am really confused by this question

  2. #2
    TS Contributor
    Points: 17,779, Level: 84
    Level completed: 86%, Points required for next Level: 71
    Karabiner's Avatar
    Location
    FC Schalke 04, Germany
    Posts
    2,542
    Thanks
    56
    Thanked 640 Times in 602 Posts

    Re: normality assumption: data itself, or residuals ?

    The residuals of the models should (preferably) be sampled from a normally distributed population. Not the unconditional values of the dependent variable. Moreover, if the sample is large enough, even non-normal residuals do not compromise the result of the statistical test.

    Wilcoxon is no direct alternative to a t-test, since Wilcoxon (for dependent variables which are measured on an ordinal scale) doesn't test for mean differences.

    Transformation is sometimes a good idea if there are inherent reasons for it and results are interpretable (e.g. often income, or time-associated variables such as reaction speed etc. could reasonably be logarithmically transformed), but not just for achieving normality.

    With kind regards

    K.

  3. #3
    Points: 906, Level: 16
    Level completed: 6%, Points required for next Level: 94

    Posts
    23
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: normality assumption: data itself, or residuals ?

    Thank you very much, Karabiner. I understand that the Wilcoxon signed-rank test is the nonparametric test equivalent to the dependent t-test.
    I think you are right when you said Wilcoxon does not test the mean difference between two groups, but often people say it is an alternate for t-test. This causes some confusion.
    In the wiki, it says:
    The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e. it is a paired difference test). It can be used as an alternative to the paired Student's t-test, t-test for matched pairs, or the t-test for dependent samples when the population cannot be assumed to be normally distributed.

  4. #4
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: normality assumption: data itself, or residuals ?


    Quote Originally Posted by fengyuwuzu View Post
    I have a question about the normality assumption, in regression, ANOVA, and T-test:

    The normalization assumption is for the original data and residuals, or only residuals?
    The normality assumption applies to the error terms. (Errors are a slightly different concept from residuals, though we examine the residuals to get an idea of what might be happening with the errors). The assumption is definitely not about the data itself.

    Note that the normal-errors assumption is not required for the ordinary least squares estimator to be unbiased, consistent, and efficient (in the sense of being a BLUE estimator). Neither is it required for the asymptotic distribution of the sample coefficients to be normal - i.e., with large-ish samples, its irrelevant. It is only required for an assurance that the sampling distribution of the coefficients will be normal if using small samples (and thus that significance tests and confidence intervals will be trustworthy, even if the sample is small).

    This is a very common question on this forum, and we have written an article about it that you can read here: http://pareonline.net/getvn.asp?v=18&n=11

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats