some more attachments....
Dear All
I'm struggling with a linear model.
My dependent variable is change in central anterior chamber depth of the eye with time. Measurement 1 is time 0, measurement 2 is 6 years later and I am regressing the change with predictor variables.
My dependent variable is not normally distributed (see histogram).
I have developed 5 hierarchical models each with less significant terms added on.
All of the models show heteroskedasticity and non-normal errors.
I want to report associations but don't want to get things wrong, so any help would be appreciated.
So, model '210' is the simplest model where the anterior chamber depth change is most correlated with the starting depth (I think this association is too strong to be explained by 'regression to the mean' but opinions would be appreciated).
This model corresponds to crPlots1 and diagnostic plot 1.Code:> model210<-lm(R_ACD_change~Racd_screen_median,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),]) > summary(model210) Call: lm(formula = R_ACD_change ~ Racd_screen_median, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 3), ]) Residuals: Min 1Q Median 3Q Max -0.39377 -0.12872 -0.02327 0.07721 0.74808 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2766 0.1787 7.145 2.13e-11 *** Racd_screen_median -0.5908 0.0737 -8.015 1.31e-13 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1868 on 181 degrees of freedom Multiple R-squared: 0.262, Adjusted R-squared: 0.2579 F-statistic: 64.25 on 1 and 181 DF, p-value: 1.312e-13
The next model adds the angle width as a predictor
This corresponds to crPlots2 and diagnostic2Code:> model220<-lm(R_ACD_change~Racd_screen_median+R_median_shaff,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),]) > summary(model220) Call: lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 3), ]) Residuals: Min 1Q Median 3Q Max -0.37483 -0.12239 -0.02013 0.08611 0.61278 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.21075 0.16842 7.189 1.71e-11 *** Racd_screen_median -0.64229 0.07000 -9.175 < 2e-16 *** R_median_shaff 0.09470 0.01852 5.115 8.03e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1755 on 179 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.3557, Adjusted R-squared: 0.3485 F-statistic: 49.4 on 2 and 179 DF, p-value: < 2.2e-16
The next model adds the follow-up lens thickness/eye length ratio. This was only available at follow-up so is not a 'predictor' but is associated with the change.
This corresponds to crPlots3 and diagnostic3Code:> model230<-lm(R_ACD_change~Racd_screen_median+R_median_shaff+R_Lens_AL,ACDdata[(ACDdata$Pi_Bl1==0 & ACDdata$Racd_screen_median<3),]) > summary(model230) Call: lm(formula = R_ACD_change ~ Racd_screen_median + R_median_shaff + R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 3), ]) Residuals: Min 1Q Median 3Q Max -0.35673 -0.09986 -0.01836 0.09092 0.54430 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.34071 0.25461 9.193 < 2e-16 *** Racd_screen_median -0.66705 0.06598 -10.110 < 2e-16 *** R_median_shaff 0.07861 0.01765 4.454 1.56e-05 *** R_Lens_AL -4.87701 0.86730 -5.623 7.91e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.16 on 164 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.4631, Adjusted R-squared: 0.4533 F-statistic: 47.16 on 3 and 164 DF, p-value: < 2.2e-16
The final model has the change in angle width as well
This corresponds to crPlots4 and diagnostic4Code:> summary(model240) Call: lm(formula = R_ACD_change ~ Racd_screen_median + R_Shaffer_Change + R_median_shaff + R_Lens_AL, data = ACDdata[(ACDdata$Pi_Bl1 == 0 & ACDdata$Racd_screen_median < 3), ]) Residuals: Min 1Q Median 3Q Max -0.32231 -0.10900 -0.01588 0.08898 0.48960 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.21582 0.24535 9.031 5.01e-16 *** Racd_screen_median -0.68331 0.06315 -10.821 < 2e-16 *** R_Shaffer_Change 0.04892 0.01444 3.387 0.000889 *** R_median_shaff 0.09389 0.01789 5.247 4.81e-07 *** R_Lens_AL -4.04595 0.84703 -4.777 3.98e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1519 on 161 degrees of freedom (17 observations deleted due to missingness) Multiple R-squared: 0.5038, Adjusted R-squared: 0.4915 F-statistic: 40.87 on 4 and 161 DF, p-value: < 2.2e-16
I'm mainly interested in reporting the association with starting angle depth (which seems strong) and with angle width (which is weak but interesting). I'm also quite interested in reporting the association with follow-up lens thickness as lens thickness changes are thought to contribute to this.
Does this seem reasonable or do I need to consider robust regression instead.
Very very many thanks for looking at this.
some more attachments....
here is the new qq plot for the full model having taken the the difference of the logs as the dept variable
You clearly do have heteroskedacity as the range of the residuals changes with the level of the fitted values (I assume it changes with the level of X although I run different diagnostics for this). And your qq plot suggests you have signficant outliers in the tail or skewness (or both)? You might run a skewness test to see and try to identify the outliers.
It is possible if you deal with the outliers the problem may not be signficant. But more likely you need to transform your data to make it homoskedastic. How is your dependent variable measured? Is it an interval scale variable?
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Are you looking at logdiagnostics.pdf or one of the earlier plots?
I don't have emotions and sometimes that makes me very sad.
@noetsi, Where do you se any heteroscedasticity?
I am not sure how to think about the independent variables.
So you start by measuring A. Then some time later you measure B. But then you take that difference (B-A)the simplest model where the anterior chamber depth change is most correlated with the starting depth
Then you take that difference and explain that by the original measure A.
In this regression model:
(B-A) = a +b*A + error
(where a and b are estimated parameters.)
Is it strange that the difference (B-A) is to a large extent is explained by A? The first estimated coefficient was close to –0.5. That’s what you could expect from two completely independent random variables.
I am not saying that the model is wrong. But is it relevant?
What kind of models do they usually use in the published literature?
SiBorg (09-28-2012)
Greta, as always, you make a very interesting point.
So, I thought I would test it by simulation, and I came up with a very interesting result.
So, we have my starting ACD depths which have mean 2.421467 and sd 0.2439209
And my follow-up ACD depths which have mean 2.393669 and sd 0.2457636
So, on average, follow-up ACD depths are shallower than baseline ones.
Now, let's assume that these are representative samples from two normal distributions - (1) the baseline depths and (2) the follow up depths
If we do this, we can generate a vector of values that are a random sample from distribution 1 and a second vector of values that are a random sample from distribution 2. We can then perform a linear regression on the change between value 1 and value 2, with value 1 as a dependent variable. However, note that value 1 does not tell you anything about value 2, because they are independent random samples taken from 2 normal distributions, with the only difference that values from the second normal distribution tend to be smaller than those taken from the first.
My code for this is as follows:
And, lo and behold, when we look at this model, starting depth significantly predicts shallowing:Code:mean(ACDdata$Racd_screen_median) [1] 2.421467 sd(ACDdata$Racd_screen_median) [1] 0.2439209 ACD1<-rnorm(200,mean(ACDdata$Racd_screen_median),sd(ACDdata$Racd_screen_median)) mean(ACDdata$R_Acd_Median) [1] 2.2628 sd(ACDdata$R_Acd_Median) [1] 0.2278464 ACD2<-rnorm(200,mean(ACDdata$R_Acd_Median),sd(ACDdata$R_Acd_Median)) t.test(ACDdata$Racd_screen_median,ACDdata$R_Acd_Median) Welch Two Sample t-test data: ACDdata$Racd_screen_median and ACDdata$R_Acd_Median t = 7.1304, df = 445.934, p-value = 4.064e-12 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.1149344 0.2023989 sample estimates: mean of x mean of y 2.421467 2.262800 t.test(ACD1,ACD2) Welch Two Sample t-test data: ACD1 and ACD2 t = 5.08, df = 394.406, p-value = 5.835e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.07312061 0.16544830 sample estimates: mean of x mean of y 2.393669 2.274385 ACDtest<-data.frame(ACD1,ACD2) ACDtest$ACDdiff<-ACDtest$ACD2-ACDtest$ACD1 model.test<-lm(ACDdiff~ACD1,data=ACDtest)
And when we plot this it looks quite convincing (see crPlot (test plot) and diagnostic plot), even though there is no relationship between any of the values from ACD1 and ACD2 other than they were sampled from a normal distribution where one happened to be smaller in mean than the other.Code:summary(model.test) Call: lm(formula = ACDdiff ~ ACD1, data = ACDtest) Residuals: Min 1Q Median 3Q Max -0.79016 -0.16329 0.01346 0.17372 0.55780 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.36231 0.15526 15.21 <2e-16 *** ACD1 -1.03673 0.06452 -16.07 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2237 on 198 degrees of freedom Multiple R-squared: 0.5659, Adjusted R-squared: 0.5637 F-statistic: 258.2 on 1 and 198 DF, p-value: < 2.2e-16
Therefore, you cannot regress on something that you have used to create the dependant variable.
Now, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...
Are the conclusions from my simulation correct? If so, I may go as far as to say that Greta is a Genius.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
I was referring to the logdiagnostics pdf from the third post. It was the only one I saw
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
You see heteroskedasticity in that plot? I don't. I mean the left side of the plot doesn't seem to have as large of a range of values but it also has a lot less values which can influence our perception of the spread of the data. It looks alright to me. And even if it was present it's not strong enough to care too much about in my opinion.
I don't have emotions and sometimes that makes me very sad.
As I said it is close. To me it seemed there was enough difference to comment on. But there is no solid rule that I am aware of (when the differences are not extreme) of how far the range has to change to have serious heteroskedacity. It comes down to eying the data and making a judgement call.
Another way of saying you could well be right. The first time I looked at it, the differences seemed further apart
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Well you can - you just shouldn't be surprised if it comes out significant but ... well we know it's related because we used it to create the variable.
Well it's easy enough to show if we assume X and Y are normally distributedNow, if someone (Dason?) can prove this mathematically, I have a great letter/paper because there have recently been 2 other similar models with the same flaw published in recent times...
so
and
So if we let Diff = Y-X and fit the model Diff~X the regression should come out pretty close to
I don't have emotions and sometimes that makes me very sad.
SiBorg (09-28-2012)
It's related, but, in my simulation, the key is that you can't use it to predict future shallowing can you? Because there was no relationship between the two measurements in my simluation, they were totally random.well we know it's related because we used it to create the variable.
Tweet |