Thread: Correlation (data need to be normall distributed?)

1. Correlation (data need to be normall distributed?)

Hi,

today, I read through the chapter 'correlation' and there it is stated:

the condition to apply either regression or correlation is that both variables should be at least approx. normally distributed.

I think this statement is true, but there must always be a correlation when both variables have the same distribution, doesn't matter if it is normal distributed or something else, right?

Cheers

2. Hi beginner,

Pearson's correlation coefficient may indeed be sensitive to non-normality in the variables, so it shouldn't be used when you have serious doubts about this distributional assumption. Still, there are other correlation measures that don't rely in that assumption, such as Spearman's Rank Based correlation or Kendall's Correlation. These are non-parametric alternatives that don't require normality. So you can say that Pearson's correlation is conditioned to normal distributions, but not any correlations.

And regarding regression analysis, your book is totally lying since there is no assumptions in regression regarding the response variable, nor the regressors. The only assumptions in a regression model are based in the residuals.

Hope that helps

3. Hi terzi,

Beginner

4. Hi guys,

Beginner - as I understand it, your first thoughts were actually correct, at least in a sense. Pearson's correlation requires only *similar* underlying distributions to accurately assess the magnitude of a correlation. However, similar *normal* distributions are required for significance tests, confidence intervals, etc, limiting the usefulness of the coefficient in the absence of normality:

http://faculty.chass.ncsu.edu/garson...rel.htm#assume

Regarding regression - Terzi, you're technically quite right in saying that the assumptions of regression are with regard to residuals and not variables per se, but it probably should be stressed that it's going to be pretty hard for residuals to be normally distributed when the responses and/or IV's are not - so in a basic text, it might make sense to refer to univariate normality of the studied variables (being an easier concept to grasp and a prerequisite for multivariate/residual normality, which can then be discussed later). So I wouldn't say the book is totally off!

5. Hi Cowboybear

I thought about what you said again and it definitely makes sense.
One can calculate the correlation coefficient for non-normal distributions
indeed but as you said one can not really apply statistic test afterwards.
So it might be better to have normal distributed data.

Best,
Beginner

6. Originally Posted by CowboyBear
Hi guys,
but it probably should be stressed that it's going to be pretty hard for residuals to be normally distributed when the responses and/or IV's are not
The book is lying, or at least being far from complete in its statements. I agree with terzi.

Regression requires the residuals to be normally distributed. It very easy for the residuals to be normally distributed when the DV and IV are not. We actually expect the IV & DV to not be normally distributed in this case, thats why you so often here people saying 'we corrected for the linear trend by using a regression ect ect bla bla'.. here's why:

[Very roughly explained because I'm not going into details]

Lets say there is a strong linear relationship between Y and X, with some error ofcourse.

X shouldn't be normal as logically you controlled for it. Textbook example is a linear increasing variable e.g. ten measurements if Y at each interval (of eg. time, temperature) thus if anything it will be uniform.

Y is the variable related to X as Y ~ b + ax + error, when X is uniform and the relationship is strong then b + ax is uniform as well. Your linear model will pick up on the first part, b + ax (hence people say they controlled for it), what remains is error and thats the only part that should be normal.

Is this making sense??

Here the same story as a R simulation:

Code:
``````#set random number gen
set.seed(1000)
# our controlled independent variable
X=1:100
# our linear related variable, b=10, a=2
Y = (10 + 2*X) + rnorm(100)
# are they normal???
par(mfrow=c(3,1))
hist(X, main='X');hist(Y,main='Y') # oh no certainly not!
hist(lm(Y~X)\$residuals)``````

7. Originally Posted by beginner
Hi,

but there must always be a correlation when both variables have the same distribution, doesn't matter if it is normal distributed or something else, right?

Cheers
Two variables can have exactly the same distribution yet have no correlation whatsoever. The key component here is 'pairs of data'. For instance high values of one variable are related to high variables of another variable. The variables are thus linked.

Pearsons correlation coefficient assumes that the two variable are normally distributed and you can see this in its equation (check it here). Here we see that it uses 1) mean values and 2) standard deviations. These measures are not 'robust statistics', which means that they are strongly influenced by non-normality:

(quotes from an earlier post)

"Medians are robust measures of central tendency (being highly inelastic) in contrast to the mean; the median has a breakdown point of 50&#37;, while the mean has a breakdown point of 0% (just one sample can already influence it)"

"Statistics as the median absolute deviation or interquartile range are robust measures of statistical dispersion, while the standard deviation and the range are not. With the latter two being highly elastic and strongly influenced by outliers."

So its not good science practice to use these measures when your data is non-normal. Thats why you cant use Pearsons coefficient when your data is not normal..you wont get an accurate answer therefore here its very logical why both variables should be normal actually.

So the 'rules':

1) with Pearsons correlation both variables should be normally distributed
2) with regression only the residuals or "error" must be normal, don't worry about the variables themselves (as remember here you actually controlled for 1 variable)

Therefore another thing to keep in mind:
Strictly speaking you should use correlation when you randomly sampled data, without controlling for a variable (e.g. both measurements are randomly sampled like samples of water in which you measure Ph and plankton ect but didn't control for Ph).

8. Thanks for clarifying that TE, that does make sense!

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts