Logistic Regression in SPSS

#1
Hey guys,

So I've been battling with Logisitc Regression in SPSS for about 1 week now, and I'm getting a bit fed up :confused:.

To set the scene: I have a dependent variable "Have you ever drunk alcohol?" and a pool of about 300 kids, with responses to questions regarding "confidence", "like of school" and "participation in out of school activities". All of these variables are categorical.

My aim is to look at each variable seperately as an exposure variable, and control for the other variables (after determining which variables are strongly associated with my dependent variable). For each exposure-dependent variable model I will obtain odds ratios around the correct estimates that will allow me to say things like "children who strongly disagree that 'school is a nice place to be' are twice as likely to have drunk alcohol than those who strongly agreed".

The difficulty comes when determining what variables to throw into the initial model, and then what to do with the initial model after I have thrown such variables in.

Thus far I have determined which variables are associated with the dependent variable via bivariate analysis and Cramer's V Test. This gave me 9 variables which all have an association (ranging from 0.1-0.35) with "Have you ever drunk alcohol". The next step is to determine whether an of these variables are associated with one another, because entering two strongly associated variables into the model can lead to instability etc... I constructed a 9x9 matrix, which indicated that none of these 9 variables are closely associated (all values being less than 0.4).

Thus all of these 9 variables should be controlled for when modelling "Have you ever drunk alcohol". [This is correct, is it not?] On top of this Gender should be controlled for.

Interaction Terms What interaction and other confounders, if any, should be put into the model? I spent a good few hours taking all of these 9 variables and:
1. Running a logisitic regression model for each pair of variables.
2. Running a logisitic regression model for each pair of variables WITH the interaction term of these two variables.
3. Determining the Log Likelihood ratio of these two models.
4. Using this statistical result, imposing that "this particular interaction term should not be put in our initial model".

Trouble is now I have about 20 initial variables... and surely the hunt for other confounders and interaction terms doesnt stop here. This isn't my main issue though, my main problem is knowing what to do when I have got my initial model.

Suppose I wanted to consider the exposure variable "Your teachers treat you fairly" - Strongly Agree, Agree, Disagree, Strongly Disagree (possibly collapsing it to an Agree and Disagree variable). To determine odds ratios with confidence intervals around the correct estimate we want to run a logistic regression model with "Have you ever drunk alcohol?" as the dependent variable, "Your teachers treat you fairly" as the exposure variable, and controlling for all other confounders: "Do your parents drink", "Gender", "Your school teachers treat you fairly", "School rules are too strict", etc...

I have the 9 variables that are associated with this the dependent variable, one of which is the exposure variable of interested. But what do I do in SPSS to run the best model possible, to end with the most appropriate/correct odds ratios and confidence intervals for my exposure variable "Your teachers treat you fairly" ??? :confused: ???

I understand the theory behind the backwards and forwards LR and Wald Stepwise method of building a model (I have been reading Kleinbaums "Logistic Regression" - excellent :tup:), I just don't know which one to use (and why) in order to achieve what I want to achieve (namely the above paragraph).

The professor of the department said:

"Run a regression model on the main affect variables (i.e. the 9 variables I have been discussing), and add interaction terms only if they increase the percentaged explained (in the SPSS output)".

But what about improving the odds ratios, and making sure you are properly controlling for the other variables (i.e. what about ensuring that the odds ratios and confidence intervals one has for the exposure variable of interest are correct?).

Thanks very much.

L-dawg
 
Last edited:
#2
Hey guys,

So I've been battling with Logisitc Regression in SPSS for about 1 week now, and I'm getting a bit fed up :confused:.
First, thanks for the detailed question. Makes it easier to answer.

And I just have to say, spending a week on model building isn't uncommon. It doesn't sound like a battle--just part of the process. :)

My aim is to look at each variable seperately as an exposure variable, and control for the other variables (after determining which variables are strongly associated with my dependent variable). For each exposure-dependent variable model I will obtain odds ratios around the correct estimates that will allow me to say things like "children who strongly disagree that 'school is a nice place to be' are twice as likely to have drunk alcohol than those who strongly agreed".

The difficulty comes when determining what variables to throw into the initial model, and then what to do with the initial model after I have thrown such variables in.

Thus far I have determined which variables are associated with the dependent variable via bivariate analysis and Cramer's V Test. This gave me 9 variables which all have an association (ranging from 0.1-0.35) with "Have you ever drunk alcohol". The next step is to determine whether an of these variables are associated with one another, because entering two strongly associated variables into the model can lead to instability etc... I constructed a 9x9 matrix, which indicated that none of these 9 variables are closely associated (all values being less than 0.4).

Thus all of these 9 variables should be controlled for when modelling "Have you ever drunk alcohol". [This is correct, is it not?] On top of this Gender should be controlled for.
My question: what is the point of this model? Is it purely prediction or are you interested in describing relationships? For example, if you're creating a model so principals can best predict which of their students is most likely to drink alcohol, you don't really care what it all means. But if you are a social science researcher, the relationships are more important than the actual prediction.

The reason I ask is because if all you care about is prediction, then go ahead and just see what works best. If, however, the point is relationships, you should be using some theoretical knowledge about the variables to decide which predictors to include. That goes for interactions as well. With that many categorical predictors (I assume you're dummy coding?), this could get difficult to interpret really fast if you include many interactions. It won't be helpful to have a "correct" model if you can't interpret it.

You seem very concerned about having the "correct" model and the "correct" odds ratio. Remember that in any statistical modeling, we never know if the model is correct. Just that it fits the data reasonably well. All the odds ratios are "correct," no matter what other control variables are in the model, but they mean different things. This is where stats and math diverge. There isn't one right answer. It's very hard to get used to.

Remember that simpler models are always considered "better." So if a model without any interactions gives you a fit equal to one with many interactions, go for the simpler model.

Good luck,
Karen