# Unobserved heterogenity

#### noetsi

##### Fortran must die
This is a signficant problem potentially for Cox and logistic regression. I have never found a test for whether it exists. I doubt there is one, but if anyone has seen a suggestion of how to detect if this exists I would appreciate comments on it

#### kiton

##### New Member
This is a signficant problem potentially for Cox and logistic regression. I have never found a test for whether it exists. I doubt there is one, but if anyone has seen a suggestion of how to detect if this exists I would appreciate comments on it
Don't kick me in the butt if I am wrong, but let me try ) There is a formal test for heterogeniety - Hausman Chi-square test. Not the one for model comparison, but the one for testing wether there exists correlation between the predictors and the residuals. I can tell you more once get back home and look into my notes.

#### noetsi

##### Fortran must die
thanks. I will look that up in the meantime.

#### spunky

##### Smelly poop man with doo doo pants.
There is a formal test for heterogeniety - Hausman Chi-square test. Not the one for model comparison, but the one for testing wether there exists correlation between the predictors and the residuals.
the Hausman test is not a test for unobserved heterogenetiy. the Hausman test is a test of ENdogeneity (as opposed to EXogeneity, which i'm guessing is what you were going for). ENdogeneity can happen for many reasons (measurement error, omitted variable models, etc.) but unobserved heterogeneity is not one of them... unless the constituting models of the heterogeneity are endogenous themselves (then it just becomes a big fat mess).

noetsi, unobserved heterogeneity is not only an issue of Cox regression. the models that we usually learn all assume homogeneity of the samples. i'll provide an R example for you using basic OLS regression. it has some small simulation stuff there so it can keep you interested

anyhoo, say noetsi wants to do an OLS multiple regression but he doesn't know his sample is actually a mix of samples coming from populations with two different regression equations in the population.

the first population is paramterized by the multiple regression:

y1 = 5 + 5x11 + 5x12 + e (where e has a variance of 5)

the second population is paramterized by the equation:

y2 = -5 - 5x21 -5x22 + e (where e also has a variance of 5)

so, if you sample say 1000 people from each group and fit each regression separately:

Code:
x11 <- rnorm(1000)
x12 <- rnorm(1000)

x21 <- rnorm(1000)
x22 <- rnorm(1000)

y1 <- 5 + 5*x11+5*x12+rnorm(1000,0,5)
y2 <- -5 + -5*x21+-5*x22+rnorm(1000,0,5)

summary(lm(y1 ~ x11 + x12))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.4383     0.1663   32.69   <2e-16 ***
x11           5.1778     0.1626   31.84   <2e-16 ***
x12           5.0767     0.1615   31.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.257 on 997 degrees of freedom
Multiple R-squared:  0.6576,    Adjusted R-squared:  0.6569
F-statistic: 957.2 on 2 and 997 DF,  p-value: < 2.2e-16

summary(lm(y2 ~ x21 + x22))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -5.1734     0.1571  -32.93   <2e-16 ***
x21          -4.8211     0.1578  -30.55   <2e-16 ***
x22          -4.7368     0.1639  -28.90   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.966 on 997 degrees of freedom
Multiple R-squared:  0.6397,    Adjusted R-squared:  0.6389
F-statistic: 884.9 on 2 and 997 DF,  p-value: < 2.2e-16
they are pretty nifty regressions by themselves. all the regression coefficients are significant, R-squared is over 60% in both cases. cool stuff.

but this is NOT what noetsi sees. noetsi doesn't know that there are actually two distinct samples from two distinct populations with two distnct regression equations. what noetsi sees is this:

Code:
Y <- c(y1,y2)
X1 <- c(x11, x12)
X2 <- c(x21, x22)

summary(lm(Y ~ X1 + X2))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.05887    0.21349   0.276    0.783
X1           2.47944    0.20782  11.931   <2e-16 ***
X2          -2.07801    0.21849  -9.511   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.545 on 1997 degrees of freedom
Multiple R-squared:  0.1032,    Adjusted R-squared:  0.1023
F-statistic: 114.9 on 2 and 1997 DF,  p-value: < 2.2e-16
so Y, X1 and X2 have each 2000 sampling units (1000 coming from Regression #1 and 1000 coming from Regression #2). as you can see, this combined-regression data set is not as nifty as each regression by itself. the R-squared is a meager 10% and the regression coefficients are lower in both cases with higher standard errors. and that makes sense. you have two samples that are behaving in entirely different ways... one has the regression line going up and the other one has the regression line going down.

within statistical parlance, we usually refer to "unobserved heterogeneity" as either finite mixtures (in statistics) or latent classes (in psychometrics). what you have is a finite mixture of multivariate normal distributions, each paramterzied by its own regression.

there are ways in which you can test for this. the only way i know is to actually fit a mixture model and see whether the fit statistics it gives you (usually information crtieria of various sorts but also more advanced things like a bootstrapped likelihood ratio test).

i can demo this example with a very nifty R package called flexmix and do something like this

Code:
library(flexmix)

mod1 <- flexmix(Y ~ X1 + X2, k=1)
mod2 <- flexmix(Y ~ X1 + X2, k=2)

summary(mod1)
prior size post>0 ratio
Comp.1     1 2000   2000     1

'log Lik.' -7348.45 (df=4)
AIC: 14704.9   BIC: 14727.3

summary(mod2)
prior size post>0 ratio
Comp.1 0.528 1041   1995 0.522
Comp.2 0.472  959   1978 0.485

'log Lik.' -7244.126 (df=9)
AIC: 14506.25   BIC: 14556.66
first of all, see how well the EM algorithm that runs under flexmix recognized that there were two samples. it allocates 1041 to sample 1 and 959 to sample 2. so it's off for less than 100 sampling units! good stuff! second, see the drop on BIC from 14727 for a model that assumes homogeneity (only 1 class) to 14556 for a model that assumes 2 latent classes. if you were to fit a model that assumes 3, 4 or more latent classes you would see the BIC go up again, in which case you know you should choose the model that assumes 2 latent classes.

if you play around a little bit with these fit statistics you should be able to get evidence from your data to decide whether or not unboserved heterogeneity is present. and, if it is present, you at least can do something about it and model it.

i'm sure SAS can do this as well. but as you know, SAS is in cahoots with the evil, evil Cauchy Distribution and i wouldn't even touch it with a 10ft pole!!!! MWAHAHAHAHAHAHAHAHAHAHAHAAAAA!!!!

#### hlsmith

##### Omega Contributor
nice example spunky. This reminded me of hierarchical analyses and being ignorant of higher levels.

statisical parlance sounds like canadian speak.

#### noetsi

##### Fortran must die
To return to your point above spunky unobserved heterogenity is tied primarily to variables that influence the model that are not in the model. I am not sure how this relates to your discussion of two populations - which seems like a different definition of heterogenity. I would think looking for outliers would be one way to capture that. But I will look up your mixture model further.

AIM actually controls R through the old S+ language. Remember the language that R was originally derived from was developed by the old Bell Labs as it was dying. Not hard to infiltrate an organization or a software that was dying...

#### spunky

##### Smelly poop man with doo doo pants.
To return to your point above spunky unobserved heterogenity is tied primarily to variables that influence the model that are not in the model. I am not sure how this relates to your discussion of two populations - which seems like a different definition of heterogenity. I would think looking for outliers would be one way to capture that. But I will look up your mixture model further..
i really think that both you and kiton are referring to endogeneity. look at the article on wikipedia. the omitted variable problem describes what you're referring to but that's a cause of endogeneity, not unboserved heterogeneity.

still, BOTH (endogeneity and unobsereved heterogeneity) mess up your analysis.

#### noetsi

##### Fortran must die
I think this is central to what allison is talking about although the omitted variable in his discussion would not have to be correlated with a variable in the model, that would just make it worse.

From the wiki link spunky gave.

In this case, the endogeneity comes from an uncontrolled confounding variable. A variable is both correlated with an independent variable in the model and with the error term.

#### noetsi

##### Fortran must die
This may provide a better basis for what Allison is talking about. Or at least clarify my lack of understanding...

"An implicit assumption of all hazard models that we have considered so far is that if two indvidiuals have identical values on the covariates, they also have identical hazard functions.....Obviously, this is an unrealistic assumption Individuals and their environments differ in so many respects that no set of measured covariates can possible capture all variation among them. In an ordinary linear regression model, this residual or unobserved heterogenity is explcitly represented y a random distrubance term....But in Cox regression model, for example there is no distrurbance term:"

The disturbance term of course is where variables left out of the model, exogenous to it, would show up. I am uncertain if this is what spunky is bringing up or not.

#### spunky

##### Smelly poop man with doo doo pants.
my take would be that he's just throwing around the term "unobserved heterogeneity" not in a statistical sense but more in a literary sense of "oh, there are so many forces in nature outside of our control that we cannot possibly account for them all". even the way in which he's expressing himself speaks of endogeneity through the omitted variable/3rd variable problem although he calls it "unobserved heterogeneity".

spunky concludes that he's not speaking of "unobserved heterogeneity" in a statistical sense but more because it sounds easier to understand that "endogeneity". maybe that's why you're having trouble figuring out what it does in Cox regression... because you're looking for "unobserved heterogeneity' and end up looking at things like i posted (or the article you linked) when you should be looking for "endogeneity" and the literature on instrumental variables, the hausman test, errors-in-variables models, etc etc

#### noetsi

##### Fortran must die
Its possible although he teaches statistics at the University of Pennsylvania and has sixty published articles - which is pretty impressive to moi. I am pretty sure he is talking about the effects of variables left out of the model, at least primarily.

Assuming that is the case, and ignoring wording, is there any way to detect this type of issue? Allison discusses this at length but in the narrow context of events that can repeat.

#### spunky

##### Smelly poop man with doo doo pants.
which tests does Allison suggest? or what strategy? or what's the name of the book and the page number where he talks about this? now you've got me curious and i wanna verify first hand what he's saying.

#### kiton

##### New Member
the Hausman test is not a test for unobserved heterogenetiy. the Hausman test is a test of ENdogeneity (as opposed to EXogeneity, which i'm guessing is what you were going for). ENdogeneity can happen for many reasons (measurement error, omitted variable models, etc.) but unobserved heterogeneity is not one of them... unless the constituting models of the heterogeneity are endogenous themselves (then it just becomes a big fat mess).

You are so right. I got confused with the terms I am sorry.

#### spunky

##### Smelly poop man with doo doo pants.
You are so right. I got confused with the terms I am sorry.
that's OK. this is a place where we all learn from each other so it's important for all to talk about stuff together. besides, endogeneity is an important problem in itself that rarely gets addressed outside from the field of econometrics/information systems/business... which i believe is the area you specialize in. so your perspective is important.

#### noetsi

##### Fortran must die
Spunky remember this is done in the context of the hazard function specifically here when the hazard function declines artifically over time due to unobserved heterogenity. That said, on p 259 Allison notes: "What can be done? As we'll see on the next section. when evemts are repeatable , it is quite feasible to separate the true hazard function from unobservered heterogenity. But when events are not repeatable ..the options are limited." On 260 he notes that there have been numerous attempts to separate the hazard function from unobserved heterogenity by formulating models that incorporate both [essentially through creating a random error term]. The problem with this is that it is unclear what distribution that error term should take as well as the specific dependence on time. This reflects the fact I feel that an error term helps deal with this issue, although obviously it occurs in forms of regression that have error terms when the omited variables are correlated with predictors in the model.

He discusses some of the problems unobserved heterogenity causes on 260 - they influence not only the SE but the slopes which I was not aware of. 260-274 addresses this issue, but it is not clear to me, I only went through this section briefly when it is dealing with unobserved heterogenity and when dealing with repeated events which are not the norm in Cox regression.

Unfortuntately I sent the book back yesterday. But it is Survival Analysis Using SAS by Paul Allison. 2nd ed printed 2010

#### spunky

##### Smelly poop man with doo doo pants.
Unfortuntately I sent the book back yesterday. But it is Survival Analysis Using SAS by Paul Allison. 2nd ed printed 2010
THANKS! my uni has an online version of the book. i'll make sure to check it out and will explore this further