+ Reply to Thread
Results 1 to 7 of 7

Thread: Pearson's versus Spearman's correlation

  1. #1
    Points: 983, Level: 16
    Level completed: 83%, Points required for next Level: 17

    Posts
    29
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Pearson's versus Spearman's correlation




    I am trying to determine which correlation coefficient is more appropriate.


    The two variables to be correlated are visual acuity (which can be negative and positive) at baseline and visual acuity after one year.


    Pearson's correlation is appropriate if:

    1. The data is interval or ratio data

    2. There is a linear relationship between the two

    3. No significant outliers

    4.. Variables should be normally distributed


    I know that 1 and 3 are unviolated.

    However, to assess linearity I made a scatter-plot which looks unusual (attached).

    I know that data for both variables are normally distributed but I also came across a term called bi-variate normal distribution. Can someone explain this term in a non-technical way?

    Does the shift in visual acuity over time mean that the fourth assumption is violated? This is somewhat confusing to me.

    I am beginning to think Spearman's correlation is more appropriate but I would like to fully justify this decision.
    Attached Images  

  2. #2
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Pearson's versus Spearman's correlation

    Bivariate normal distribution means:

    (Marginal) distribution of first variable is normal
    (Marginal) distribution of second variable is normal
    Given any value of the one variable, the (conditional) distribution of the other variable is normal

    You can't really tell much about normality from a scatter plot. Try a q-q plot of each variable maybe, and if you're wanting to be really careful a q-q plot of the residuals of Y when X is a predictor (and vice versa).

  3. #3
    TS Contributor
    Points: 17,749, Level: 84
    Level completed: 80%, Points required for next Level: 101
    Karabiner's Avatar
    Location
    FC Schalke 04, Germany
    Posts
    2,540
    Thanks
    56
    Thanked 639 Times in 601 Posts

    Re: Pearson's versus Spearman's correlation

    What has always bothered me - Pearson's r is exactly
    the same as beta in a single regression, the p-value for
    testing r or beta, respectively, is exactely the same; but
    in case of correlation bivariate nomality is assumed,
    while in regression only resiudals should be normally
    distributed (in the population), and even this can be
    more or less neglected if sample size is large. This would
    mean, simply neglect the normality assumptions for the
    testing of Pearson's r, if sample size is large?

    With kind regards

    K.

  4. #4
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Pearson's versus Spearman's correlation

    Yeah, I've never quite got my head around that one either; it doesn't make sense. Anybody got any ideas?

  5. #5
    Points: 983, Level: 16
    Level completed: 83%, Points required for next Level: 17

    Posts
    29
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Pearson's versus Spearman's correlation

    Thanks for your help guys.

    It's kind of strange because my Q-Q plots look unusual too, unless I'm doing it wrong.

    The population sample I have is about 100,000 cases.

    In this case, should the condition of normality be ignored?

    Is it justified to do so?

  6. #6
    TS Contributor
    Points: 17,749, Level: 84
    Level completed: 80%, Points required for next Level: 101
    Karabiner's Avatar
    Location
    FC Schalke 04, Germany
    Posts
    2,540
    Thanks
    56
    Thanked 639 Times in 601 Posts

    Re: Pearson's versus Spearman's correlation

    At least the standard error of the estmation is so extremely tiny with 100,000
    cases that the assumption sounds strange that the shape of the distributions
    could still affect the inference from the sample data to the population (i.e. the
    significance statement). An if this was linear regression, one would be absolutely
    sure that the results of the significance test are reliable.

  7. #7
    TS Contributor
    Points: 22,350, Level: 92
    Level completed: 50%, Points required for next Level: 1,000
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: Pearson's versus Spearman's correlation


    Quote Originally Posted by Karabiner View Post
    This would
    mean, simply neglect the normality assumptions for the
    testing of Pearson's r, if sample size is large?
    Quote Originally Posted by CowboyBear View Post
    Yeah, I've never quite got my head around that one either; it doesn't make sense. Anybody got any ideas?
    well... it seems pretty reasonable to me as for why the behaviour of the test statistic for the case of correlation and simple bivariate regression SHOULD be the same... but the important thing here is that we don't need to conjecture anything about it. we can simulate it.

    first, i'd like to direct people's attention HERE for why i am just simulating data with a non-zero kurtosis. kurtosis is intrinsically related with the variability of anything variance related... so if you're after the standard errors of covariances, variances or correlations, kurtosis is going to show up one way or another.

    the first half of the article is a very good and useful explanation of how kurtosis distorts the standard errors of variances/covariances/correlations. the second half deals with more Structural Equation Modelling (SEM) stuff so i wouldn't recommend it as much unless you're interested.

    anyhoo... the simulation.

    first, i ran a "baseline" simulation where the data was distributed bivariate normal and got what you would expect: the empirical rejection rates were very, very close to .05 (and equal in both cases) because it's essentially the same test.

    let's see what happens when you have a smallish sample size of 30:

    Code: 
    library(lavaan) # need this to generate the data
    
    # correlation is 0 in the population so any rejection rate over .05 is an inflated Type I error rate
    mod1 <- 'x ~~ 0.0*y
             x ~~ 1*x
    	   y ~~ 1*y'
    
    reps <- 10000 #10,000 repetitions
    N <- 30       #sample size of 30
    
    pval_reg <- double(reps)
    pval_cor <- double(reps)
    
    for (i in 1:reps){
    
    datum <- simulateData(mod1, sample.nobs=N, skewness=c(0,0), kurtosis=c(25,25))  #population skewness is 0 and population kurtosis is 25 for each variable
    fitt <- summary(lm(y~x, data=datum))
    
    pval_reg[i]<-fitt$coefficients[2,4]
    pval_cor[i]<-cor.test(datum$x,datum$y)$p.value
    
    }
    
    sum(pval_reg<.05)/reps
    [1] 0.062
    
    sum(pval_cor<.05)/reps
    [1] 0.062
    so an inflated type I error rate as an effect of non-zero kurtosis for both.

    let's do it again but with a sample size of 1000

    Code: 
    library(lavaan) # need this to generate the data
    
    # correlation is 0 in the population so any rejection rate over .05 is an inflated Type I error rate
    mod1 <- 'x ~~ 0.0*y
             x ~~ 1*x
    	   y ~~ 1*y'
    
    reps <- 10000 #10,000 repetitions
    N <- 30       #sample size of 30
    
    pval_reg <- double(reps)
    pval_cor <- double(reps)
    
    for (i in 1:reps){
    
    datum <- simulateData(mod1, sample.nobs=N, skewness=c(0,0), kurtosis=c(25,25))  #population skewness is 0 and population kurtosis is 25 for each variable
    fitt <- summary(lm(y~x, data=datum))
    
    pval_reg[i]<-fitt$coefficients[2,4]
    pval_cor[i]<-cor.test(datum$x,datum$y)$p.value
    
    }
    
    sum(pval_reg<.05)/reps
    [1] 0.048
    
    sum(pval_cor<.05)/reps
    [1] 0.048
    which is pretty close to .05. so yeah... we can validate the intuition Karabiner had. for large sample sizes, distributional assumptions become irrelevant for these tests. my guess would be, of course, that if the method that you're using is complex (like SEM instead of a simple bivariate regression/correlation) then the distributional assumptions become important again.
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats