heteroskedasticity and non normal residuals in linear regression - please help!

SiBorg

New Member
#1
Dear All

I'm struggling with a linear model.

My dependent variable is change in central anterior chamber depth of the eye with time. Measurement 1 is time 0, measurement 2 is 6 years later and I am regressing the change with predictor variables.

My dependent variable is not normally distributed (see histogram).

I have developed 5 hierarchical models each with less significant terms added on.

All of the models show heteroskedasticity and non-normal errors.

I want to report associations but don't want to get things wrong, so any help would be appreciated.

So, model '210' is the simplest model where the anterior chamber depth change is most correlated with the starting depth (I think this association is too strong to be explained by 'regression to the mean' but opinions would be appreciated).

Code:
> model210<-lm(R_ACD_change~Racd_screen_median,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
> summary(model210)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median, data = ACDdata[(ACDdata$Pi_Bl1 == 
    0 & ACDdata$Racd_screen_median < 3), ])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.39377 -0.12872 -0.02327  0.07721  0.74808 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.2766     0.1787   7.145 2.13e-11 ***
Racd_screen_median  -0.5908     0.0737  -8.015 1.31e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1868 on 181 degrees of freedom
Multiple R-squared: 0.262,	Adjusted R-squared: 0.2579 
F-statistic: 64.25 on 1 and 181 DF,  p-value: 1.312e-13
This model corresponds to crPlots1 and diagnostic plot 1.

The next model adds the angle width as a predictor

Code:
> model220<-lm(R_ACD_change~Racd_screen_median+R_median_shaff,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
> summary(model220)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff, 
    data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 
        3), ])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.37483 -0.12239 -0.02013  0.08611  0.61278 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.21075    0.16842   7.189 1.71e-11 ***
Racd_screen_median -0.64229    0.07000  -9.175  < 2e-16 ***
R_median_shaff      0.09470    0.01852   5.115 8.03e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1755 on 179 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared: 0.3557,	Adjusted R-squared: 0.3485 
F-statistic:  49.4 on 2 and 179 DF,  p-value: < 2.2e-16
This corresponds to crPlots2 and diagnostic2

The next model adds the follow-up lens thickness/eye length ratio. This was only available at follow-up so is not a 'predictor' but is associated with the change.


Code:
> model230<-lm(R_ACD_change~Racd_screen_median+R_median_shaff+R_Lens_AL,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
> summary(model230)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff + 
    R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 
    3), ])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35673 -0.09986 -0.01836  0.09092  0.54430 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.34071    0.25461   9.193  < 2e-16 ***
Racd_screen_median -0.66705    0.06598 -10.110  < 2e-16 ***
R_median_shaff      0.07861    0.01765   4.454 1.56e-05 ***
R_Lens_AL          -4.87701    0.86730  -5.623 7.91e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.16 on 164 degrees of freedom
  (15 observations deleted due to missingness)
Multiple R-squared: 0.4631,	Adjusted R-squared: 0.4533 
F-statistic: 47.16 on 3 and 164 DF,  p-value: < 2.2e-16
This corresponds to crPlots3 and diagnostic3

The final model has the change in angle width as well

Code:
> summary(model240)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median + R_Shaffer_Change + 
    R_median_shaff + R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 
    0 & ACDdata$Racd_screen_median < 3), ])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32231 -0.10900 -0.01588  0.08898  0.48960 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.21582    0.24535   9.031 5.01e-16 ***
Racd_screen_median -0.68331    0.06315 -10.821  < 2e-16 ***
R_Shaffer_Change    0.04892    0.01444   3.387 0.000889 ***
R_median_shaff      0.09389    0.01789   5.247 4.81e-07 ***
R_Lens_AL          -4.04595    0.84703  -4.777 3.98e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1519 on 161 degrees of freedom
  (17 observations deleted due to missingness)
Multiple R-squared: 0.5038,	Adjusted R-squared: 0.4915 
F-statistic: 40.87 on 4 and 161 DF,  p-value: < 2.2e-16
This corresponds to crPlots4 and diagnostic4

I'm mainly interested in reporting the association with starting angle depth (which seems strong) and with angle width (which is weak but interesting). I'm also quite interested in reporting the association with follow-up lens thickness as lens thickness changes are thought to contribute to this.

Does this seem reasonable or do I need to consider robust regression instead.

Very very many thanks for looking at this.
 

noetsi

Fortran must die
#4
You clearly do have heteroskedacity as the range of the residuals changes with the level of the fitted values (I assume it changes with the level of X although I run different diagnostics for this). And your qq plot suggests you have signficant outliers in the tail or skewness (or both)? You might run a skewness test to see and try to identify the outliers.

It is possible if you deal with the outliers the problem may not be signficant. But more likely you need to transform your data to make it homoskedastic. How is your dependent variable measured? Is it an interval scale variable?
 
#7
I am not sure how to think about the independent variables.

the simplest model where the anterior chamber depth change is most correlated with the starting depth
So you start by measuring A. Then some time later you measure B. But then you take that difference (B-A)

Then you take that difference and explain that by the original measure A.

In this regression model:

(B-A) = a +b*A + error

(where a and b are estimated parameters.)

Is it strange that the difference (B-A) is to a large extent is explained by A? The first estimated coefficient was close to –0.5. That’s what you could expect from two completely independent random variables.

I am not saying that the model is wrong. But is it relevant?

What kind of models do they usually use in the published literature?
 

SiBorg

New Member
#8
Greta, as always, you make a very interesting point.

So, I thought I would test it by simulation, and I came up with a very interesting result.

So, we have my starting ACD depths which have mean 2.421467 and sd 0.2439209

And my follow-up ACD depths which have mean 2.393669 and sd 0.2457636

So, on average, follow-up ACD depths are shallower than baseline ones.

Now, let's assume that these are representative samples from two normal distributions - (1) the baseline depths and (2) the follow up depths

If we do this, we can generate a vector of values that are a random sample from distribution 1 and a second vector of values that are a random sample from distribution 2. We can then perform a linear regression on the change between value 1 and value 2, with value 1 as a dependent variable. However, note that value 1 does not tell you anything about value 2, because they are independent random samples taken from 2 normal distributions, with the only difference that values from the second normal distribution tend to be smaller than those taken from the first.

My code for this is as follows:

Code:
mean(ACDdata$Racd_screen_median)
[1] 2.421467
sd(ACDdata$Racd_screen_median)
[1] 0.2439209
ACD1<-rnorm(200,mean(ACDdata$Racd_screen_median),sd(ACDdata$Racd_screen_median))

mean(ACDdata$R_Acd_Median)
[1] 2.2628
sd(ACDdata$R_Acd_Median)
[1] 0.2278464
ACD2<-rnorm(200,mean(ACDdata$R_Acd_Median),sd(ACDdata$R_Acd_Median))

t.test(ACDdata$Racd_screen_median,ACDdata$R_Acd_Median)

	Welch Two Sample t-test

data:  ACDdata$Racd_screen_median and ACDdata$R_Acd_Median 
t = 7.1304, df = 445.934, p-value = 4.064e-12
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.1149344 0.2023989 
sample estimates:
mean of x mean of y 
 2.421467  2.262800 

t.test(ACD1,ACD2)

	Welch Two Sample t-test

data:  ACD1 and ACD2 
t = 5.08, df = 394.406, p-value = 5.835e-07
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.07312061 0.16544830 
sample estimates:
mean of x mean of y 
 2.393669  2.274385

ACDtest<-data.frame(ACD1,ACD2)
ACDtest$ACDdiff<-ACDtest$ACD2-ACDtest$ACD1
model.test<-lm(ACDdiff~ACD1,data=ACDtest)
And, lo and behold, when we look at this model, starting depth significantly predicts shallowing:

Code:
summary(model.test)

Call:
lm(formula = ACDdiff ~ ACD1, data = ACDtest)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79016 -0.16329  0.01346  0.17372  0.55780 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.36231    0.15526   15.21   <2e-16 ***
ACD1        -1.03673    0.06452  -16.07   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2237 on 198 degrees of freedom
Multiple R-squared: 0.5659,	Adjusted R-squared: 0.5637 
F-statistic: 258.2 on 1 and 198 DF,  p-value: < 2.2e-16
And when we plot this it looks quite convincing (see crPlot (test plot) and diagnostic plot), even though there is no relationship between any of the values from ACD1 and ACD2 other than they were sampled from a normal distribution where one happened to be smaller in mean than the other.

Therefore, you cannot regress on something that you have used to create the dependant variable.

Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...

Are the conclusions from my simulation correct? If so, I may go as far as to say that Greta is a Genius.
 

noetsi

Fortran must die
#9
@noetsi, Where do you se any heteroscedasticity?
To me in the residuals versus fitted values graph it looks like the spreads are not similar across the levels of the X value. Which is how heteroscedasticity is captured. Admitedly the difference in the spread is not great
 

Dason

Ambassador to the humans
#10
To me in the residuals versus fitted values graph it looks like the spreads are not similar across the levels of the X value. Which is how heteroscedasticity is captured. Admitedly the difference in the spread is not great
Yes but which plot? He fit quite a few models.
 

Dason

Ambassador to the humans
#12
You see heteroskedasticity in that plot? I don't. I mean the left side of the plot doesn't seem to have as large of a range of values but it also has a lot less values which can influence our perception of the spread of the data. It looks alright to me. And even if it was present it's not strong enough to care too much about in my opinion.
 

noetsi

Fortran must die
#13
As I said it is close. To me it seemed there was enough difference to comment on. But there is no solid rule that I am aware of (when the differences are not extreme) of how far the range has to change to have serious heteroskedacity. It comes down to eying the data and making a judgement call.

Another way of saying you could well be right. The first time I looked at it, the differences seemed further apart:p
 

Dason

Ambassador to the humans
#14
Therefore, you cannot regress on something that you have used to create the dependant variable.
Well you can - you just shouldn't be surprised if it comes out significant but ... well we know it's related because we used it to create the variable.
Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...
Well it's easy enough to show if we assume X and Y are normally distributed

\(X \sim N(\mu_x, \sigma_x^2)\)
\(Y \sim N(\mu_y, \sigma_y^2) \)
so
\(Y-X \sim N(\mu_y - \mu_x, \sigma_y^2 + \sigma_x^2)\)
and
\((Y-X | X) \sim N(\mu_y - X, \sigma_y^2)\)

So if we let Diff = Y-X and fit the model Diff~X the regression should come out pretty close to \(\widehat{\mbox{Diff}_i} = \mu_y -1*X_i\)
 
#15
well we know it's related because we used it to create the variable.
It's related, but, in my simulation, the key is that you can't use it to predict future shallowing can you? Because there was no relationship between the two measurements in my simluation, they were totally random.
 
#16
In other words, is this 'wrong'? Or does it help to know that the starting conditions predict change? Even though they don't predict anything, it just arises out of the maths?
 

noetsi

Fortran must die
#17
Given your qq plot I strongly suggest you do a skewness test and DFBETA. The data appears to be abnormal with a lot of outliers in the tail. Its best to check.
 
#18
If so, I may go as far as to say that Greta is a Genius.
Now, I am really embarrassed! Very embarrassed!

Ok it is nice that we say friendly words to each other. Thank you!

Sometimes we, I mean myself, makes stupid comments. Then maybe it is good if we are not to frank.

If we regress: (random number – A) versus A, of course the regression coefficient will be around –1.

I mean that’s what the left hand equations says. There is a –1 in front of A in the left hand side of the equation. As Dason later on point out. (So I don’t know where I was dreaming out the –0.5 coefficient. Stupid guess of me!)

Anyway, this is an important model that is used again and again. It is very common to take the difference to the baseline and to use the baseline as an explanatory factor. I say again: it is not wrong, but is it relevant?

Would it be better to just use the late periods value and baseline as explanatory variable? (I don’t think so but I don’t know why.)

It is very natural to take the difference so that “the individuals acts as its own control”.

There is a simple solution to this and at the moment I can’t se it.

Explain to us!

(A note: It was nice of Noetsi to explain where he saw the heteroscedasticity. I can’t se that. Anyway thanks!
 
#19
Is this simply an example of 'regression to the mean'. It's just regression to a lower mean which is why the effects look significant. What we really need is a way of looking at the slopes of the lines between each baseline and follow-up depth, corrected for the difference in means, to see weather there is any 'true' effect.
 
#20
As I said it is close. To me it seemed there was enough difference to comment on. But there is no solid rule that I am aware of (when the differences are not extreme) of how far the range has to change to have serious heteroskedacity. It comes down to eying the data and making a judgement call.

Another way of saying you could well be right. The first time I looked at it, the differences seemed further apart:p
A note: It was nice of Noetsi to explain where he saw the heteroscedasticity. I can’t se that. Anyway thanks!
You see heteroskedasticity in that plot? I don't. I mean the left side of the plot doesn't seem to have as large of a range of values but it also has a lot less values which can influence our perception of the spread of the data. It looks alright to me. And even if it was present it's not strong enough to care too much about in my opinion.
Greta this is why I said recently that interpreting the qq plot is subjective. We can see that the result differs from person to person, or from time to time in one person.

Besides, I too agree on that G thing! GGG = GretaGarboGenius :D :)

Or GGGG = GretaGreatGarboGenius :D