Originally Posted by

**kiton**
There is a formal test for heterogeniety - Hausman Chi-square test. Not the one for model comparison, but the one for testing wether there exists correlation between the predictors and the residuals.

the Hausman test is not a test for unobserved heterogenetiy. the Hausman test is a test of **EN**dogeneity (as opposed to **EX**ogeneity, which i'm guessing is what you were going for). **EN**dogeneity can happen for many reasons (measurement error, omitted variable models, etc.) but unobserved heterogeneity is not one of them... unless the constituting models of the heterogeneity are endogenous themselves (then it just becomes a big fat mess).

noetsi, unobserved heterogeneity is not only an issue of Cox regression. the models that we usually learn all assume homogeneity of the samples. i'll provide an R example for you using basic OLS regression. it has some small simulation stuff there so it can keep you interested

anyhoo, say noetsi wants to do an OLS multiple regression but he doesn't know his sample is actually a mix of samples coming from populations with two different regression equations in the population.

the first population is paramterized by the multiple regression:

y1 = 5 + 5x11 + 5x12 + e (where e has a variance of 5)

the second population is paramterized by the equation:

y2 = -5 - 5x21 -5x22 + e (where e also has a variance of 5)

so, if you sample say 1000 people from each group and fit each regression separately:

they are pretty nifty regressions by themselves. all the regression coefficients are significant, R-squared is over 60% in both cases. cool stuff.

but this is NOT what noetsi sees. noetsi doesn't know that there are actually two distinct samples from two distinct populations with two distnct regression equations. what noetsi sees is this:

so Y, X1 and X2 have each 2000 sampling units (1000 coming from Regression #1 and 1000 coming from Regression #2). as you can see, this combined-regression data set is not as nifty as each regression by itself. the R-squared is a meager 10% and the regression coefficients are lower in both cases with higher standard errors. and that makes sense. you have two samples that are behaving in entirely different ways... one has the regression line going up and the other one has the regression line going down.

within statistical parlance, we usually refer to "unobserved heterogeneity" as either finite mixtures (in statistics) or latent classes (in psychometrics). what you have is a finite mixture of multivariate normal distributions, each paramterzied by its own regression.

there are ways in which you can test for this. the only way i know is to actually fit a mixture model and see whether the fit statistics it gives you (usually information crtieria of various sorts but also more advanced things like a bootstrapped likelihood ratio test).

i can demo this example with a very nifty R package called flexmix and do something like this

first of all, see how well the EM algorithm that runs under flexmix recognized that there were two samples. it allocates 1041 to sample 1 and 959 to sample 2. so it's off for less than 100 sampling units! good stuff! second, see the drop on BIC from 14727 for a model that assumes homogeneity (only 1 class) to 14556 for a model that assumes 2 latent classes. if you were to fit a model that assumes 3, 4 or more latent classes you would see the BIC go up again, in which case you know you should choose the model that assumes 2 latent classes.

if you play around a little bit with these fit statistics you should be able to get evidence from your data to decide whether or not unboserved heterogeneity is present. and, if it is present, you at least can do something about it and model it.

i'm sure SAS can do this as well. but as you know, SAS is in cahoots with the evil, evil Cauchy Distribution and i wouldn't even touch it with a 10ft pole!!!! MWAHAHAHAHAHAHAHAHAHAHAHAAAAA!!!!