# Pearson's versus Spearman's correlation

#### jle

##### New Member
I am trying to determine which correlation coefficient is more appropriate.

The two variables to be correlated are visual acuity (which can be negative and positive) at baseline and visual acuity after one year.

Pearson's correlation is appropriate if:

1. The data is interval or ratio data

2. There is a linear relationship between the two

3. No significant outliers

4.. Variables should be normally distributed

I know that 1 and 3 are unviolated.

However, to assess linearity I made a scatter-plot which looks unusual (attached).

I know that data for both variables are normally distributed but I also came across a term called bi-variate normal distribution. Can someone explain this term in a non-technical way?

Does the shift in visual acuity over time mean that the fourth assumption is violated? This is somewhat confusing to me.

I am beginning to think Spearman's correlation is more appropriate but I would like to fully justify this decision.

#### CowboyBear

##### Super Moderator
Bivariate normal distribution means:

(Marginal) distribution of first variable is normal
(Marginal) distribution of second variable is normal
Given any value of the one variable, the (conditional) distribution of the other variable is normal

You can't really tell much about normality from a scatter plot. Try a q-q plot of each variable maybe, and if you're wanting to be really careful a q-q plot of the residuals of Y when X is a predictor (and vice versa).

#### Karabiner

##### TS Contributor
What has always bothered me - Pearson's r is exactly
the same as beta in a single regression, the p-value for
testing r or beta, respectively, is exactely the same; but
in case of correlation bivariate nomality is assumed,
while in regression only resiudals should be normally
distributed (in the population), and even this can be
more or less neglected if sample size is large. This would
mean, simply neglect the normality assumptions for the
testing of Pearson's r, if sample size is large?

With kind regards

K.

#### CowboyBear

##### Super Moderator
Yeah, I've never quite got my head around that one either; it doesn't make sense. Anybody got any ideas?

#### jle

##### New Member

It's kind of strange because my Q-Q plots look unusual too, unless I'm doing it wrong.

The population sample I have is about 100,000 cases.

In this case, should the condition of normality be ignored?

Is it justified to do so?

#### Karabiner

##### TS Contributor
At least the standard error of the estmation is so extremely tiny with 100,000
cases that the assumption sounds strange that the shape of the distributions
could still affect the inference from the sample data to the population (i.e. the
significance statement). An if this was linear regression, one would be absolutely
sure that the results of the significance test are reliable.

#### spunky

##### Smelly poop man with doo doo pants.
This would
mean, simply neglect the normality assumptions for the
testing of Pearson's r, if sample size is large?
Yeah, I've never quite got my head around that one either; it doesn't make sense. Anybody got any ideas?
well... it seems pretty reasonable to me as for why the behaviour of the test statistic for the case of correlation and simple bivariate regression SHOULD be the same... but the important thing here is that we don't need to conjecture anything about it. we can simulate it.

first, i'd like to direct people's attention HERE for why i am just simulating data with a non-zero kurtosis. kurtosis is intrinsically related with the variability of anything variance related... so if you're after the standard errors of covariances, variances or correlations, kurtosis is going to show up one way or another.

the first half of the article is a very good and useful explanation of how kurtosis distorts the standard errors of variances/covariances/correlations. the second half deals with more Structural Equation Modelling (SEM) stuff so i wouldn't recommend it as much unless you're interested.

anyhoo... the simulation.

first, i ran a "baseline" simulation where the data was distributed bivariate normal and got what you would expect: the empirical rejection rates were very, very close to .05 (and equal in both cases) because it's essentially the same test.

let's see what happens when you have a smallish sample size of 30:

Code:
library(lavaan) # need this to generate the data

# correlation is 0 in the population so any rejection rate over .05 is an inflated Type I error rate
mod1 <- 'x ~~ 0.0*y
x ~~ 1*x
y ~~ 1*y'

reps <- 10000 #10,000 repetitions
N <- 30       #sample size of 30

pval_reg <- double(reps)
pval_cor <- double(reps)

for (i in 1:reps){

datum <- simulateData(mod1, sample.nobs=N, skewness=c(0,0), kurtosis=c(25,25))  #population skewness is 0 and population kurtosis is 25 for each variable
fitt <- summary(lm(y~x, data=datum))

pval_reg[i]<-fitt$coefficients[2,4] pval_cor[i]<-cor.test(datum$x,datum$y)$p.value

}

sum(pval_reg<.05)/reps
[1] 0.062

sum(pval_cor<.05)/reps
[1] 0.062
so an inflated type I error rate as an effect of non-zero kurtosis for both.

let's do it again but with a sample size of 1000

Code:
library(lavaan) # need this to generate the data

# correlation is 0 in the population so any rejection rate over .05 is an inflated Type I error rate
mod1 <- 'x ~~ 0.0*y
x ~~ 1*x
y ~~ 1*y'

reps <- 10000 #10,000 repetitions
N <- 30       #sample size of 30

pval_reg <- double(reps)
pval_cor <- double(reps)

for (i in 1:reps){

datum <- simulateData(mod1, sample.nobs=N, skewness=c(0,0), kurtosis=c(25,25))  #population skewness is 0 and population kurtosis is 25 for each variable
fitt <- summary(lm(y~x, data=datum))

pval_reg[i]<-fitt$coefficients[2,4] pval_cor[i]<-cor.test(datum$x,datum$y)$p.value

}

sum(pval_reg<.05)/reps
[1] 0.048

sum(pval_cor<.05)/reps
[1] 0.048
which is pretty close to .05. so yeah... we can validate the intuition Karabiner had. for large sample sizes, distributional assumptions become irrelevant for these tests. my guess would be, of course, that if the method that you're using is complex (like SEM instead of a simple bivariate regression/correlation) then the distributional assumptions become important again.