Dear All

I'm struggling with a linear model.

My dependent variable is change in central anterior chamber depth of the eye with time. Measurement 1 is time 0, measurement 2 is 6 years later and I am regressing the change with predictor variables.

My dependent variable is not normally distributed (see histogram).

I have developed 5 hierarchical models each with less significant terms added on.

All of the models show heteroskedasticity and non-normal errors.

I want to report associations but don't want to get things wrong, so any help would be appreciated.

So, model '210' is the simplest model where the anterior chamber depth change is most correlated with the starting depth (I think this association is too strong to be explained by 'regression to the mean' but opinions would be appreciated).

Code:
> model210<-lm(R_ACD_change~Racd_screen_median,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
> summary(model210)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 3), ])

Residuals:
Min       1Q   Median       3Q      Max
-0.39377 -0.12872 -0.02327  0.07721  0.74808

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)          1.2766     0.1787   7.145 2.13e-11 ***
Racd_screen_median  -0.5908     0.0737  -8.015 1.31e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1868 on 181 degrees of freedom
Multiple R-squared: 0.262,	Adjusted R-squared: 0.2579
F-statistic: 64.25 on 1 and 181 DF,  p-value: 1.312e-13
This model corresponds to crPlots1 and diagnostic plot 1.

The next model adds the angle width as a predictor

Code:
> model220<-lm(R_ACD_change~Racd_screen_median+R_median_shaff,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
> summary(model220)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff,
data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median <
3), ])

Residuals:
Min       1Q   Median       3Q      Max
-0.37483 -0.12239 -0.02013  0.08611  0.61278

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         1.21075    0.16842   7.189 1.71e-11 ***
Racd_screen_median -0.64229    0.07000  -9.175  < 2e-16 ***
R_median_shaff      0.09470    0.01852   5.115 8.03e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1755 on 179 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.3557,	Adjusted R-squared: 0.3485
F-statistic:  49.4 on 2 and 179 DF,  p-value: < 2.2e-16
This corresponds to crPlots2 and diagnostic2

The next model adds the follow-up lens thickness/eye length ratio. This was only available at follow-up so is not a 'predictor' but is associated with the change.

Code:
> model230<-lm(R_ACD_change~Racd_screen_median+R_median_shaff+R_Lens_AL,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),])
> summary(model230)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff +
R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median <
3), ])

Residuals:
Min       1Q   Median       3Q      Max
-0.35673 -0.09986 -0.01836  0.09092  0.54430

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         2.34071    0.25461   9.193  < 2e-16 ***
Racd_screen_median -0.66705    0.06598 -10.110  < 2e-16 ***
R_median_shaff      0.07861    0.01765   4.454 1.56e-05 ***
R_Lens_AL          -4.87701    0.86730  -5.623 7.91e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.16 on 164 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.4631,	Adjusted R-squared: 0.4533
F-statistic: 47.16 on 3 and 164 DF,  p-value: < 2.2e-16
This corresponds to crPlots3 and diagnostic3

The final model has the change in angle width as well

Code:
> summary(model240)

Call:
lm(formula = R_ACD_change ~ Racd_screen_median + R_Shaffer_Change +
R_median_shaff + R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 3), ])

Residuals:
Min       1Q   Median       3Q      Max
-0.32231 -0.10900 -0.01588  0.08898  0.48960

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         2.21582    0.24535   9.031 5.01e-16 ***
Racd_screen_median -0.68331    0.06315 -10.821  < 2e-16 ***
R_Shaffer_Change    0.04892    0.01444   3.387 0.000889 ***
R_median_shaff      0.09389    0.01789   5.247 4.81e-07 ***
R_Lens_AL          -4.04595    0.84703  -4.777 3.98e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1519 on 161 degrees of freedom
(17 observations deleted due to missingness)
Multiple R-squared: 0.5038,	Adjusted R-squared: 0.4915
F-statistic: 40.87 on 4 and 161 DF,  p-value: < 2.2e-16
This corresponds to crPlots4 and diagnostic4

I'm mainly interested in reporting the association with starting angle depth (which seems strong) and with angle width (which is weak but interesting). I'm also quite interested in reporting the association with follow-up lens thickness as lens thickness changes are thought to contribute to this.

Does this seem reasonable or do I need to consider robust regression instead.

Very very many thanks for looking at this.

some more attachments....

here is the new qq plot for the full model having taken the the difference of the logs as the dept variable

You clearly do have heteroskedacity as the range of the residuals changes with the level of the fitted values (I assume it changes with the level of X although I run different diagnostics for this). And your qq plot suggests you have signficant outliers in the tail or skewness (or both)? You might run a skewness test to see and try to identify the outliers.

It is possible if you deal with the outliers the problem may not be signficant. But more likely you need to transform your data to make it homoskedastic. How is your dependent variable measured? Is it an interval scale variable?

Are you looking at logdiagnostics.pdf or one of the earlier plots?

@noetsi, Where do you se any heteroscedasticity?

I am not sure how to think about the independent variables.

the simplest model where the anterior chamber depth change is most correlated with the starting depth
So you start by measuring A. Then some time later you measure B. But then you take that difference (B-A)

Then you take that difference and explain that by the original measure A.

In this regression model:

(B-A) = a +b*A + error

(where a and b are estimated parameters.)

Is it strange that the difference (B-A) is to a large extent is explained by A? The first estimated coefficient was close to –0.5. That’s what you could expect from two completely independent random variables.

I am not saying that the model is wrong. But is it relevant?

What kind of models do they usually use in the published literature?

8. ## The Following User Says Thank You to GretaGarbo For This Useful Post:

SiBorg (09-28-2012)

Greta, as always, you make a very interesting point.

So, I thought I would test it by simulation, and I came up with a very interesting result.

So, we have my starting ACD depths which have mean 2.421467 and sd 0.2439209

And my follow-up ACD depths which have mean 2.393669 and sd 0.2457636

So, on average, follow-up ACD depths are shallower than baseline ones.

Now, let's assume that these are representative samples from two normal distributions - (1) the baseline depths and (2) the follow up depths

If we do this, we can generate a vector of values that are a random sample from distribution 1 and a second vector of values that are a random sample from distribution 2. We can then perform a linear regression on the change between value 1 and value 2, with value 1 as a dependent variable. However, note that value 1 does not tell you anything about value 2, because they are independent random samples taken from 2 normal distributions, with the only difference that values from the second normal distribution tend to be smaller than those taken from the first.

My code for this is as follows:

Code:
mean(ACDdata$Racd_screen_median) [1] 2.421467 sd(ACDdata$Racd_screen_median)
[1] 0.2439209
ACD1<-rnorm(200,mean(ACDdata$Racd_screen_median),sd(ACDdata$Racd_screen_median))

mean(ACDdata$R_Acd_Median) [1] 2.2628 sd(ACDdata$R_Acd_Median)
[1] 0.2278464
ACD2<-rnorm(200,mean(ACDdata$R_Acd_Median),sd(ACDdata$R_Acd_Median))

t.test(ACDdata$Racd_screen_median,ACDdata$R_Acd_Median)

Welch Two Sample t-test

data:  ACDdata$Racd_screen_median and ACDdata$R_Acd_Median
t = 7.1304, df = 445.934, p-value = 4.064e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1149344 0.2023989
sample estimates:
mean of x mean of y
2.421467  2.262800

t.test(ACD1,ACD2)

Welch Two Sample t-test

data:  ACD1 and ACD2
t = 5.08, df = 394.406, p-value = 5.835e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.07312061 0.16544830
sample estimates:
mean of x mean of y
2.393669  2.274385

ACDtest<-data.frame(ACD1,ACD2)
ACDtest$ACDdiff<-ACDtest$ACD2-ACDtest\$ACD1
model.test<-lm(ACDdiff~ACD1,data=ACDtest)
And, lo and behold, when we look at this model, starting depth significantly predicts shallowing:

Code:
summary(model.test)

Call:
lm(formula = ACDdiff ~ ACD1, data = ACDtest)

Residuals:
Min       1Q   Median       3Q      Max
-0.79016 -0.16329  0.01346  0.17372  0.55780

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.36231    0.15526   15.21   <2e-16 ***
ACD1        -1.03673    0.06452  -16.07   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2237 on 198 degrees of freedom
Multiple R-squared: 0.5659,	Adjusted R-squared: 0.5637
F-statistic: 258.2 on 1 and 198 DF,  p-value: < 2.2e-16
And when we plot this it looks quite convincing (see crPlot (test plot) and diagnostic plot), even though there is no relationship between any of the values from ACD1 and ACD2 other than they were sampled from a normal distribution where one happened to be smaller in mean than the other.

Therefore, you cannot regress on something that you have used to create the dependant variable.

Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...

Are the conclusions from my simulation correct? If so, I may go as far as to say that Greta is a Genius.

Originally Posted by GretaGarbo
@noetsi, Where do you se any heteroscedasticity?
To me in the residuals versus fitted values graph it looks like the spreads are not similar across the levels of the X value. Which is how heteroscedasticity is captured. Admitedly the difference in the spread is not great

Originally Posted by noetsi
To me in the residuals versus fitted values graph it looks like the spreads are not similar across the levels of the X value. Which is how heteroscedasticity is captured. Admitedly the difference in the spread is not great
Yes but which plot? He fit quite a few models.

I was referring to the logdiagnostics pdf from the third post. It was the only one I saw

You see heteroskedasticity in that plot? I don't. I mean the left side of the plot doesn't seem to have as large of a range of values but it also has a lot less values which can influence our perception of the spread of the data. It looks alright to me. And even if it was present it's not strong enough to care too much about in my opinion.

As I said it is close. To me it seemed there was enough difference to comment on. But there is no solid rule that I am aware of (when the differences are not extreme) of how far the range has to change to have serious heteroskedacity. It comes down to eying the data and making a judgement call.

Another way of saying you could well be right. The first time I looked at it, the differences seemed further apart

Originally Posted by SiBorg
Therefore, you cannot regress on something that you have used to create the dependant variable.
Well you can - you just shouldn't be surprised if it comes out significant but ... well we know it's related because we used it to create the variable.
Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...
Well it's easy enough to show if we assume X and Y are normally distributed

so

and

So if we let Diff = Y-X and fit the model Diff~X the regression should come out pretty close to

16. ## The Following User Says Thank You to Dason For This Useful Post:

SiBorg (09-28-2012)

well we know it's related because we used it to create the variable.
It's related, but, in my simulation, the key is that you can't use it to predict future shallowing can you? Because there was no relationship between the two measurements in my simluation, they were totally random.