+ Reply to Thread
Results 1 to 8 of 8

Thread: Correlation (data need to be normall distributed?)

  1. #1
    Points: 3,185, Level: 34
    Level completed: 90%, Points required for next Level: 15

    Posts
    40
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Correlation (data need to be normall distributed?)



    Hi,

    today, I read through the chapter 'correlation' and there it is stated:

    the condition to apply either regression or correlation is that both variables should be at least approx. normally distributed.

    I think this statement is true, but there must always be a correlation when both variables have the same distribution, doesn't matter if it is normal distributed or something else, right?

    Cheers

  2. #2
    TS Contributor
    Points: 3,913, Level: 39
    Level completed: 76%, Points required for next Level: 37
    terzi's Avatar
    Location
    Poza Rica, Mexico
    Posts
    378
    Thanks
    2
    Thanked 25 Times in 25 Posts
    Hi beginner,

    Your book is partially lying.

    Pearson's correlation coefficient may indeed be sensitive to non-normality in the variables, so it shouldn't be used when you have serious doubts about this distributional assumption. Still, there are other correlation measures that don't rely in that assumption, such as Spearman's Rank Based correlation or Kendall's Correlation. These are non-parametric alternatives that don't require normality. So you can say that Pearson's correlation is conditioned to normal distributions, but not any correlations.

    And regarding regression analysis, your book is totally lying since there is no assumptions in regression regarding the response variable, nor the regressors. The only assumptions in a regression model are based in the residuals.

    Hope that helps
    Statisticians are engaged in an exhausting but exhilarating struggle with the biggest challenge that philosophy makes to science: how do we translate information into knowledge

  3. #3
    Points: 3,185, Level: 34
    Level completed: 90%, Points required for next Level: 15

    Posts
    40
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi terzi,

    thanks for you reply.
    While waiting for an answer, I read more about regression and correlation and I agree with your statements.

    Thanks for your support!
    Beginner

  4. #4
    TS Contributor
    Points: 7,332, Level: 56
    Level completed: 91%, Points required for next Level: 18
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    1,190
    Thanks
    32
    Thanked 132 Times in 101 Posts
    Hi guys,

    Beginner - as I understand it, your first thoughts were actually correct, at least in a sense. Pearson's correlation requires only *similar* underlying distributions to accurately assess the magnitude of a correlation. However, similar *normal* distributions are required for significance tests, confidence intervals, etc, limiting the usefulness of the coefficient in the absence of normality:

    http://faculty.chass.ncsu.edu/garson...rel.htm#assume

    Regarding regression - Terzi, you're technically quite right in saying that the assumptions of regression are with regard to residuals and not variables per se, but it probably should be stressed that it's going to be pretty hard for residuals to be normally distributed when the responses and/or IV's are not - so in a basic text, it might make sense to refer to univariate normality of the studied variables (being an easier concept to grasp and a prerequisite for multivariate/residual normality, which can then be discussed later). So I wouldn't say the book is totally off!

  5. #5
    Points: 3,185, Level: 34
    Level completed: 90%, Points required for next Level: 15

    Posts
    40
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Hi Cowboybear

    thanks for your reply as well!

    I thought about what you said again and it definitely makes sense.
    One can calculate the correlation coefficient for non-normal distributions
    indeed but as you said one can not really apply statistic test afterwards.
    So it might be better to have normal distributed data.

    Best,
    Beginner

  6. #6
    R purist
    Points: 13,351, Level: 75
    Level completed: 26%, Points required for next Level: 299
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,282
    Thanks
    112
    Thanked 250 Times in 126 Posts
    Quote Originally Posted by CowboyBear View Post
    Hi guys,
    but it probably should be stressed that it's going to be pretty hard for residuals to be normally distributed when the responses and/or IV's are not
    The book is lying, or at least being far from complete in its statements. I agree with terzi.

    Regression requires the residuals to be normally distributed. It very easy for the residuals to be normally distributed when the DV and IV are not. We actually expect the IV & DV to not be normally distributed in this case, thats why you so often here people saying 'we corrected for the linear trend by using a regression ect ect bla bla'.. here's why:

    [Very roughly explained because I'm not going into details]

    Lets say there is a strong linear relationship between Y and X, with some error ofcourse.

    X shouldn't be normal as logically you controlled for it. Textbook example is a linear increasing variable e.g. ten measurements if Y at each interval (of eg. time, temperature) thus if anything it will be uniform.

    Y is the variable related to X as Y ~ b + ax + error, when X is uniform and the relationship is strong then b + ax is uniform as well. Your linear model will pick up on the first part, b + ax (hence people say they controlled for it), what remains is error and thats the only part that should be normal.

    Is this making sense??

    Here the same story as a R simulation:

    Code: 
    #set random number gen
    set.seed(1000)
    # our controlled independent variable
    X=1:100
    # our linear related variable, b=10, a=2
    Y = (10 + 2*X) + rnorm(100)
    # are they normal???
    par(mfrow=c(3,1))
    hist(X, main='X');hist(Y,main='Y') # oh no certainly not!
    # what about the error?
    hist(lm(Y~X)$residuals)
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  7. #7
    R purist
    Points: 13,351, Level: 75
    Level completed: 26%, Points required for next Level: 299
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,282
    Thanks
    112
    Thanked 250 Times in 126 Posts
    Quote Originally Posted by beginner View Post
    Hi,

    but there must always be a correlation when both variables have the same distribution, doesn't matter if it is normal distributed or something else, right?

    Cheers
    Two variables can have exactly the same distribution yet have no correlation whatsoever. The key component here is 'pairs of data'. For instance high values of one variable are related to high variables of another variable. The variables are thus linked.

    Pearsons correlation coefficient assumes that the two variable are normally distributed and you can see this in its equation (check it here). Here we see that it uses 1) mean values and 2) standard deviations. These measures are not 'robust statistics', which means that they are strongly influenced by non-normality:

    (quotes from an earlier post)

    "Medians are robust measures of central tendency (being highly inelastic) in contrast to the mean; the median has a breakdown point of 50%, while the mean has a breakdown point of 0% (just one sample can already influence it)"

    "Statistics as the median absolute deviation or interquartile range are robust measures of statistical dispersion, while the standard deviation and the range are not. With the latter two being highly elastic and strongly influenced by outliers."

    So its not good science practice to use these measures when your data is non-normal. Thats why you cant use Pearsons coefficient when your data is not normal..you wont get an accurate answer therefore here its very logical why both variables should be normal actually.

    So the 'rules':

    1) with Pearsons correlation both variables should be normally distributed
    2) with regression only the residuals or "error" must be normal, don't worry about the variables themselves (as remember here you actually controlled for 1 variable)

    Therefore another thing to keep in mind:
    Strictly speaking you should use correlation when you randomly sampled data, without controlling for a variable (e.g. both measurements are randomly sampled like samples of water in which you measure Ph and plankton ect but didn't control for Ph).
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  8. #8
    TS Contributor
    Points: 7,332, Level: 56
    Level completed: 91%, Points required for next Level: 18
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    1,190
    Thanks
    32
    Thanked 132 Times in 101 Posts

    Thanks for clarifying that TE, that does make sense!

+ Reply to Thread

Similar Threads

  1. Question about ANOVA alternatives for non-normally distributed data
    By KanjiTester in forum Statistical Research
    Replies: 0
    Last Post: 11-08-2010, 07:58 AM
  2. Replies: 0
    Last Post: 03-26-2010, 06:42 AM
  3. Replies: 4
    Last Post: 12-27-2009, 03:54 PM
  4. Pareto Distributed Data
    By legend in forum Statistics
    Replies: 5
    Last Post: 06-05-2009, 11:36 AM
  5. Replies: 4
    Last Post: 11-15-2006, 02:15 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats