Hello Pete!
Please allow me to double-check the following: "8 predictors at present, all categorical" -- how many categories are there per predictor? Have you dichotomized them?
Hello everybody,
I have a logistic regression model, where the outcome Yes is coded as 1, No as 0. The split between Yes and No is about 30/70 (about 135 yesses, 340 odd noes, data set 475). I have 8 predictors at present, all categorical. The SPSS comes through fine, shows some predictors are significant, some not, a Nagelkerke of about 0.20 which is OK in my field. The model with the predictors in predicts about 75% of outcomes correctly, but bear in mind 70% are Noes anyway.
What I do find however, is that the model can predict over 90% of Noes correctly, but only 25% of yesses. The reason for this is that out of the 475 cases, about 400 were predicted to be no, meaning that when it turns out most of them are Noes, I get a good % correct for that.
I am wondering, is that 'the outcome,' or does it suggest some fundamental flaw somewhere? It may be that the predictors are all predictors of Noes, or the predictors of Noes are stronger than Yesses? Or is there some other reason which could cause such a high negative prediction?
Many thanks for reading
Pete
Hello Pete!
Please allow me to double-check the following: "8 predictors at present, all categorical" -- how many categories are there per predictor? Have you dichotomized them?
Good description of your model! Not everyone does that. First big question, all 8 predictors are categorical. I am assuming binary??
What does you model look like when it is not saturated. Does it do a better job when some variables are not in it. From you description, it seems like you still may have poor predictors in the model - if so, how can you complain about the accuracy. Chop it down and play around with a simpler model and see if you still have the same issues. It could be there are just poor predictors botching things up.
Not everyone loves the Hosmer Lemeshow Test for model fit, but what does that look like? It examines the predicted versus observed.
Stop cowardice, ban guns!
After reading hlsmith's comment, I additionally wanted to ask: what are the (a) pseudo R squared, and (b) log likelihood?
The former, surely, is not "that representative" of the model fit, yet it gives initial thoughts
kiton,
I believe the "Nagelkerke of about 0.20" is an adjusted R-sq.
OP,
Also, if we get you thinking about variable inclusion in the model - if models are nested (simpler versions), you can use the -2log likelihood values to examine if variable should be included in the model.
Stop cowardice, ban guns!
Hello everybody,
Wow thank you all for your replies!
To answer some of the questions - yes all the variables are binary, things like gender, whether the participant had particular experience (Yes/No) etc.
The Hosmer and Lemeshaw comes up as 0.70 so not significant, meaning decent fit?
Is 8 quite saturated as a model? I was thinking it was reasonable in terms of EPV? I'll have a play around with that then.
The log likelihood of the overall model is 425.554, not sure whether that's good or bad? The model has a chi2 of 64.55, and a sig of 0.000 compared to the null model.
Should probably say in my world (international relations) we aren't looking for things like causation or parsimony of models, just testing for associations between outcomes and variables. The thing that gets me is that some of the variable are positive and significant, meaning they are associated with producing a Yes? And yet my model doesn't actually predict very many yesses? And due to the high number of Noes predicted that means I can't really trust the 90% success for noes, as it's just a fluke really. My hunch is that the variables have only a moderate association with the outcomes and that's the problem, rather than a problem with the running of the model?
Pete
Tweet |