+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 15 of 36

Thread: heteroskedasticity and non normal residuals in linear regression - please help!

  1. #1
    Points: 5,259, Level: 46
    Level completed: 55%, Points required for next Level: 91
    SiBorg's Avatar
    Posts
    255
    Thanks
    71
    Thanked 25 Times in 22 Posts

    heteroskedasticity and non normal residuals in linear regression - please help!




    Dear All

    I'm struggling with a linear model.

    My dependent variable is change in central anterior chamber depth of the eye with time. Measurement 1 is time 0, measurement 2 is 6 years later and I am regressing the change with predictor variables.

    My dependent variable is not normally distributed (see histogram).

    I have developed 5 hierarchical models each with less significant terms added on.

    All of the models show heteroskedasticity and non-normal errors.

    I want to report associations but don't want to get things wrong, so any help would be appreciated.

    So, model '210' is the simplest model where the anterior chamber depth change is most correlated with the starting depth (I think this association is too strong to be explained by 'regression to the mean' but opinions would be appreciated).

    Code: 
    > model210<-lm(R_ACD_change~Racd_screen_median,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
    > summary(model210)
    
    Call:
    lm(formula = R_ACD_change ~ Racd_screen_median, data = ACDdata[(ACDdata$Pi_Bl1 == 
        0 & ACDdata$Racd_screen_median < 3), ])
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.39377 -0.12872 -0.02327  0.07721  0.74808 
    
    Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
    (Intercept)          1.2766     0.1787   7.145 2.13e-11 ***
    Racd_screen_median  -0.5908     0.0737  -8.015 1.31e-13 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.1868 on 181 degrees of freedom
    Multiple R-squared: 0.262,	Adjusted R-squared: 0.2579 
    F-statistic: 64.25 on 1 and 181 DF,  p-value: 1.312e-13
    This model corresponds to crPlots1 and diagnostic plot 1.

    The next model adds the angle width as a predictor

    Code: 
    > model220<-lm(R_ACD_change~Racd_screen_median+R_median_shaff,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
    > summary(model220)
    
    Call:
    lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff, 
        data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 
            3), ])
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.37483 -0.12239 -0.02013  0.08611  0.61278 
    
    Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
    (Intercept)         1.21075    0.16842   7.189 1.71e-11 ***
    Racd_screen_median -0.64229    0.07000  -9.175  < 2e-16 ***
    R_median_shaff      0.09470    0.01852   5.115 8.03e-07 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.1755 on 179 degrees of freedom
      (1 observation deleted due to missingness)
    Multiple R-squared: 0.3557,	Adjusted R-squared: 0.3485 
    F-statistic:  49.4 on 2 and 179 DF,  p-value: < 2.2e-16
    This corresponds to crPlots2 and diagnostic2

    The next model adds the follow-up lens thickness/eye length ratio. This was only available at follow-up so is not a 'predictor' but is associated with the change.


    Code: 
    > model230<-lm(R_ACD_change~Racd_screen_median+R_median_shaff+R_Lens_AL,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
    > summary(model230)
    
    Call:
    lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff + 
        R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 
        3), ])
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.35673 -0.09986 -0.01836  0.09092  0.54430 
    
    Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
    (Intercept)         2.34071    0.25461   9.193  < 2e-16 ***
    Racd_screen_median -0.66705    0.06598 -10.110  < 2e-16 ***
    R_median_shaff      0.07861    0.01765   4.454 1.56e-05 ***
    R_Lens_AL          -4.87701    0.86730  -5.623 7.91e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.16 on 164 degrees of freedom
      (15 observations deleted due to missingness)
    Multiple R-squared: 0.4631,	Adjusted R-squared: 0.4533 
    F-statistic: 47.16 on 3 and 164 DF,  p-value: < 2.2e-16
    This corresponds to crPlots3 and diagnostic3

    The final model has the change in angle width as well

    Code: 
    > summary(model240)
    
    Call:
    lm(formula = R_ACD_change ~ Racd_screen_median + R_Shaffer_Change + 
        R_median_shaff + R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 
        0 & ACDdata$Racd_screen_median < 3), ])
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.32231 -0.10900 -0.01588  0.08898  0.48960 
    
    Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
    (Intercept)         2.21582    0.24535   9.031 5.01e-16 ***
    Racd_screen_median -0.68331    0.06315 -10.821  < 2e-16 ***
    R_Shaffer_Change    0.04892    0.01444   3.387 0.000889 ***
    R_median_shaff      0.09389    0.01789   5.247 4.81e-07 ***
    R_Lens_AL          -4.04595    0.84703  -4.777 3.98e-06 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.1519 on 161 degrees of freedom
      (17 observations deleted due to missingness)
    Multiple R-squared: 0.5038,	Adjusted R-squared: 0.4915 
    F-statistic: 40.87 on 4 and 161 DF,  p-value: < 2.2e-16
    This corresponds to crPlots4 and diagnostic4

    I'm mainly interested in reporting the association with starting angle depth (which seems strong) and with angle width (which is weak but interesting). I'm also quite interested in reporting the association with follow-up lens thickness as lens thickness changes are thought to contribute to this.

    Does this seem reasonable or do I need to consider robust regression instead.

    Very very many thanks for looking at this.
    Attached Files

  2. #2
    Points: 5,259, Level: 46
    Level completed: 55%, Points required for next Level: 91
    SiBorg's Avatar
    Posts
    255
    Thanks
    71
    Thanked 25 Times in 22 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    some more attachments....
    Attached Files

  3. #3
    Points: 5,259, Level: 46
    Level completed: 55%, Points required for next Level: 91
    SiBorg's Avatar
    Posts
    255
    Thanks
    71
    Thanked 25 Times in 22 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    here is the new qq plot for the full model having taken the the difference of the logs as the dept variable
    Attached Images

  4. #4
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    You clearly do have heteroskedacity as the range of the residuals changes with the level of the fitted values (I assume it changes with the level of X although I run different diagnostics for this). And your qq plot suggests you have signficant outliers in the tail or skewness (or both)? You might run a skewness test to see and try to identify the outliers.

    It is possible if you deal with the outliers the problem may not be signficant. But more likely you need to transform your data to make it homoskedastic. How is your dependent variable measured? Is it an interval scale variable?
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  5. #5
    Devorador de queso
    Points: 95,754, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,932
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    Are you looking at logdiagnostics.pdf or one of the earlier plots?
    I don't have emotions and sometimes that makes me very sad.

  6. #6
    Human
    Points: 12,676, Level: 73
    Level completed: 57%, Points required for next Level: 174
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,362
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    @noetsi, Where do you se any heteroscedasticity?

  7. #7
    Human
    Points: 12,676, Level: 73
    Level completed: 57%, Points required for next Level: 174
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,362
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    I am not sure how to think about the independent variables.

    the simplest model where the anterior chamber depth change is most correlated with the starting depth
    So you start by measuring A. Then some time later you measure B. But then you take that difference (B-A)

    Then you take that difference and explain that by the original measure A.

    In this regression model:

    (B-A) = a +b*A + error

    (where a and b are estimated parameters.)

    Is it strange that the difference (B-A) is to a large extent is explained by A? The first estimated coefficient was close to –0.5. That’s what you could expect from two completely independent random variables.

    I am not saying that the model is wrong. But is it relevant?

    What kind of models do they usually use in the published literature?

  8. The Following User Says Thank You to GretaGarbo For This Useful Post:

    SiBorg (09-28-2012)

  9. #8
    Points: 5,259, Level: 46
    Level completed: 55%, Points required for next Level: 91
    SiBorg's Avatar
    Posts
    255
    Thanks
    71
    Thanked 25 Times in 22 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    Greta, as always, you make a very interesting point.

    So, I thought I would test it by simulation, and I came up with a very interesting result.

    So, we have my starting ACD depths which have mean 2.421467 and sd 0.2439209

    And my follow-up ACD depths which have mean 2.393669 and sd 0.2457636

    So, on average, follow-up ACD depths are shallower than baseline ones.

    Now, let's assume that these are representative samples from two normal distributions - (1) the baseline depths and (2) the follow up depths

    If we do this, we can generate a vector of values that are a random sample from distribution 1 and a second vector of values that are a random sample from distribution 2. We can then perform a linear regression on the change between value 1 and value 2, with value 1 as a dependent variable. However, note that value 1 does not tell you anything about value 2, because they are independent random samples taken from 2 normal distributions, with the only difference that values from the second normal distribution tend to be smaller than those taken from the first.

    My code for this is as follows:

    Code: 
    mean(ACDdata$Racd_screen_median)
    [1] 2.421467
    sd(ACDdata$Racd_screen_median)
    [1] 0.2439209
    ACD1<-rnorm(200,mean(ACDdata$Racd_screen_median),sd(ACDdata$Racd_screen_median))
    
    mean(ACDdata$R_Acd_Median)
    [1] 2.2628
    sd(ACDdata$R_Acd_Median)
    [1] 0.2278464
    ACD2<-rnorm(200,mean(ACDdata$R_Acd_Median),sd(ACDdata$R_Acd_Median))
    
    t.test(ACDdata$Racd_screen_median,ACDdata$R_Acd_Median)
    
    	Welch Two Sample t-test
    
    data:  ACDdata$Racd_screen_median and ACDdata$R_Acd_Median 
    t = 7.1304, df = 445.934, p-value = 4.064e-12
    alternative hypothesis: true difference in means is not equal to 0 
    95 percent confidence interval:
     0.1149344 0.2023989 
    sample estimates:
    mean of x mean of y 
     2.421467  2.262800 
    
    t.test(ACD1,ACD2)
    
    	Welch Two Sample t-test
    
    data:  ACD1 and ACD2 
    t = 5.08, df = 394.406, p-value = 5.835e-07
    alternative hypothesis: true difference in means is not equal to 0 
    95 percent confidence interval:
     0.07312061 0.16544830 
    sample estimates:
    mean of x mean of y 
     2.393669  2.274385
    
    ACDtest<-data.frame(ACD1,ACD2)
    ACDtest$ACDdiff<-ACDtest$ACD2-ACDtest$ACD1
    model.test<-lm(ACDdiff~ACD1,data=ACDtest)
    And, lo and behold, when we look at this model, starting depth significantly predicts shallowing:

    Code: 
    summary(model.test)
    
    Call:
    lm(formula = ACDdiff ~ ACD1, data = ACDtest)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.79016 -0.16329  0.01346  0.17372  0.55780 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  2.36231    0.15526   15.21   <2e-16 ***
    ACD1        -1.03673    0.06452  -16.07   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.2237 on 198 degrees of freedom
    Multiple R-squared: 0.5659,	Adjusted R-squared: 0.5637 
    F-statistic: 258.2 on 1 and 198 DF,  p-value: < 2.2e-16
    And when we plot this it looks quite convincing (see crPlot (test plot) and diagnostic plot), even though there is no relationship between any of the values from ACD1 and ACD2 other than they were sampled from a normal distribution where one happened to be smaller in mean than the other.

    Therefore, you cannot regress on something that you have used to create the dependant variable.

    Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...

    Are the conclusions from my simulation correct? If so, I may go as far as to say that Greta is a Genius.
    Attached Files

  10. #9
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    Quote Originally Posted by GretaGarbo View Post
    @noetsi, Where do you se any heteroscedasticity?
    To me in the residuals versus fitted values graph it looks like the spreads are not similar across the levels of the X value. Which is how heteroscedasticity is captured. Admitedly the difference in the spread is not great
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  11. #10
    Devorador de queso
    Points: 95,754, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,932
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    Quote Originally Posted by noetsi View Post
    To me in the residuals versus fitted values graph it looks like the spreads are not similar across the levels of the X value. Which is how heteroscedasticity is captured. Admitedly the difference in the spread is not great
    Yes but which plot? He fit quite a few models.
    I don't have emotions and sometimes that makes me very sad.

  12. #11
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    I was referring to the logdiagnostics pdf from the third post. It was the only one I saw
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #12
    Devorador de queso
    Points: 95,754, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,932
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    You see heteroskedasticity in that plot? I don't. I mean the left side of the plot doesn't seem to have as large of a range of values but it also has a lot less values which can influence our perception of the spread of the data. It looks alright to me. And even if it was present it's not strong enough to care too much about in my opinion.
    I don't have emotions and sometimes that makes me very sad.

  14. #13
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    As I said it is close. To me it seemed there was enough difference to comment on. But there is no solid rule that I am aware of (when the differences are not extreme) of how far the range has to change to have serious heteroskedacity. It comes down to eying the data and making a judgement call.

    Another way of saying you could well be right. The first time I looked at it, the differences seemed further apart
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  15. #14
    Devorador de queso
    Points: 95,754, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,932
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!

    Quote Originally Posted by SiBorg View Post
    Therefore, you cannot regress on something that you have used to create the dependant variable.
    Well you can - you just shouldn't be surprised if it comes out significant but ... well we know it's related because we used it to create the variable.
    Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...
    Well it's easy enough to show if we assume X and Y are normally distributed

    X \sim N(\mu_x, \sigma_x^2)
    Y \sim N(\mu_y, \sigma_y^2)
    so
    Y-X \sim N(\mu_y - \mu_x, \sigma_y^2 + \sigma_x^2)
    and
    (Y-X | X) \sim N(\mu_y - X, \sigma_y^2)

    So if we let Diff = Y-X and fit the model Diff~X the regression should come out pretty close to \widehat{\mbox{Diff}_i} = \mu_y -1*X_i
    I don't have emotions and sometimes that makes me very sad.

  16. The Following User Says Thank You to Dason For This Useful Post:

    SiBorg (09-28-2012)

  17. #15
    Points: 5,259, Level: 46
    Level completed: 55%, Points required for next Level: 91
    SiBorg's Avatar
    Posts
    255
    Thanks
    71
    Thanked 25 Times in 22 Posts

    Re: heteroskedasticity and non normal residuals in linear regression - please help!


    well we know it's related because we used it to create the variable.
    It's related, but, in my simulation, the key is that you can't use it to predict future shallowing can you? Because there was no relationship between the two measurements in my simluation, they were totally random.

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats