Hosmer and Lemeshow Test is significant

#1
Hi all,

can anyone tell me if one can still use a model which reports a H&L test with significant chi-square? (I have a very large sample - + 100 000 lines).

Thanks!!
 

noetsi

Fortran must die
#2
H&L is popular because there are not a lot of alternatives. It is also conservative so its more robust than other test to large sample size. Still a hundred thousand cases will generate tremendous statistical power and thus might lead the test to be signficant when it should not be. The problem is that it is hard to show that the test is wrong because there are no easy alternatives to use (and those that exist such as the Pearson chi square test probably will be even more influenced by sample size).

One thing that occurs to me - but I don't know if this is statistically valid, would be to run the same test with a random subsample of your data, say a thousand cases, and see if the H&L test is still signficant. If it is then sample size is not the issue (although some like Paul Alison are critical of the test generally).
 

hlsmith

Omega Contributor
#3
I agree with noetsi idea of running a random drawn subsample. Though, you have not provided much information on your model, which could also be a source of the problem. How many variables, what are the frequencies for your dependent variable (how many per group)? What strategies did you use to build the model?
 

Dason

Ambassador to the humans
#4
The problem is that it is hard to show that the test is wrong
Why would you assume that the test is wrong? My guess is the there is lack of fit. The issue isn't whether the test is wrong or not - it's whether we care and how bad is the fit. With that much data you'll easily reject the null for even small deviations from the proposed model.

So why are you fitting the model in the first place? Is it for prediction? Are you trying to figure out what variables are important? What is the end goal?
 
#5
Hi,

You are right, I didn't give much info on the model. I have a logit model with a DV (0 for probability of going to the ED in a severe condition; 1 for probability of going to the ED in a non-severe condition) and I have several demographic variables (sex, age, distance, health insurance, financial status, etc.) and one key predictor variable Year (1 for 2012, 0 for 2011) since I want to estimate the impact of a political change (increase in payment). I drew this model with the assumption that only people with non-severe condition would be afected by the change (year).
As Dason mentioned, I'm not very worried about the significance because I actually don't want to predict the fact that a person goes to the ED with a severe or non-severe disease, (and this actually depends on a multitude of factors, both genetic, behavioural, etc.), it's more like a "case-control design" (I don't think I can say this). I am just worried that , as the H&L test is significant, I cannot use the model.
Thanks a lot guys!!!
 

Dason

Ambassador to the humans
#7
One thing you might want to examine is whether the other covariates in your model could be modified in some way (maybe by adding squared terms as well) to produce a better fit. Have you looked into something like that?
 

noetsi

Fortran must die
#9
Why would you assume that the test is wrong? My guess is the there is lack of fit. The issue isn't whether the test is wrong or not - it's whether we care and how bad is the fit. With that much data you'll easily reject the null for even small deviations from the proposed model.

So why are you fitting the model in the first place? Is it for prediction? Are you trying to figure out what variables are important? What is the end goal?
I am too lose in my wording Dason. By "wrong" I mean that the model did not fit the data which is "wrong" to me substantively:p

I did not realize that whether you were doing primarily prediction or relative importance mattered in analyzing the results of H and L. Nothing I read on that, and I have read a lot recently as I prepared a primer for myself on Logistic regression in the context of SAS, suggested this was the case.

It's a major reason I come here. I find things here none of the literature I spend so much time on mentions.

Several comments from the above statements. First, I was not actually sure my suggestion on taking a subsample was valid. It appears it might be which is encouraging :p

Second, I had not seen that H&L can tell you how badly the model fit the data. Is there a way you can tell how badly the data fits the model through H&L, as for example the incidices in structural equation models which show to some extent how poorly the model fits the data (admitedly through rules of thumb from experts which are not true in every case). I thought H&L was simply telling you it does or does not fit the data, not the degree to which it was.

Third, Allison suggests that how many groups you use in the H&L test (the default is 10) can radically change the result of the H&L test. He suggest changing them (which Strata at least allows) and seeing if that matters. He also suggests that interaction terms are particularly problematic with H&L results (although he does not suggest why - it appears to be an empirical observation rather than a theoretical one on his part).

He suggests adding interaction and non-linear terms and seeing if this improves the model (and thus the H&L results). His take on the H&L test in this regard seems different than other authors. Thus he states...."Remember that what....HL test are evaluating is whether you can improve the model by including interactions and non-linearities."

I think it is testing if there are problem generally with all the variables and forms of them (including interaction) which may be what he really means.
 
Last edited:
#10
I haven't squared any of my variables... I will look into that also.
I've done backward (wald) method and the model just removes some variables but the R2 is almost the same.
 

noetsi

Fortran must die
#11
I am not a big fan of backward or forward regression as it is commonly attacked in the literature for reasons that make sense to me. It relies too much on chance results that may be different from sample to sample.

Do you have any terms that might reasonably be interaction terms? For instance could gender and age interact?
 

hlsmith

Omega Contributor
#12
Back to my question, cannot remember if overdispersion will throw things off, but do you have any variables that are almost exclusively related to one of the dependent variable groups? And what is your ratio of the two variable groups for the dependent variable?

It might also be of benefit if you post the output for your model. In actuallity we still know so little about the content and what you are doing, its like having a mechanic on the phone trying to tell you how to fix your car - where what you are looking at and what they are visuallizing may be two completely different things.
 

hlsmith

Omega Contributor
#13
Not sure if it already exists in abundance, but if a large sample size results in significant H-L test, I would think this warning would be plastered all over the place.
 

noetsi

Fortran must die
#14
I am not sure what you mean by "all over the place." When I was looking at comments on H&L recently (and in past readings on it) the issue of power was not raised except by Allison.

It is certainly not raised in the statistical software of course.
 
#15
The fact that you are doing variable selection and considering adding non-linear terms is a good hint that you should expect your fit to fail with enough data, because you don't really have a model. When you write down a model and do a fit you are not merely asserting that your model is a good guess at the basic shape of the data and asking your goodness-of-fit test if that's about right. You are asserting that your model (plus perfectly Gaussian noise) is the complete and exact description of how the world works, and you are daring your goodness-of-fit test to prove you wrong with even the tiniest deviation it can find. It sounds like you actually just want to know whether the model is a decent first guess, not whether it is the end-all-and-be-all description of reality. In that case, you are better off just showing a graph of your data that illustrates visually that the fit is pretty good, rather than doing a goodness-of-fit test.
 

hlsmith

Omega Contributor
#20
Thanks for the info ichbin.

Don't worry, Dason is not trying to be a super automated pain. Its just trying to make this forum the cleanest most accurate resource for community-generated statistical conversations.