Logistic Regression: Testing for interaction effects with many covariates

#1
Hello everyone! As topic states, I am running a logistic regression with quite a lot of independent variables (23). There are a lot more candidate variables but this is the best model I ended up with. I would like to know if there are any possible interaction/synergy effects among them, but there are too many combinations to test each possible interaction term in a different model. Is there a test I can use to test for possible interaction terms among my IV's? Btw, I am using my college's Stata 11.

I am fairly new to regression and have been trying to google and read up on this but all I get are pointers to ANOVA, which I doubt can be used for binary dependent variables. (don't they violate the assumptions for ANOVA?). This is also my first time joining a statistics forum. Thank you everyone in advance.
 

jpkelley

TS Contributor
#2
Good question. First, I was struck with the large number of variables. You must have a pretty large data set to be able to handle that. May I ask how many total rows of data you have, just so everyone responding can have an idea about this?

You obviously have the right about of caution about wanting to avoid over-saturating your model. Unfortunately, for interactions, the multiple regression is exactly how you'll test if there are interactions. Have you considered one of the many model selection procedures out there? In your case, this might be a way to include as many variables and interactions into each model as you can, and then run what I imagine would be a very large number of models with all possible combinations. Then, you'll be forced into a model-averaging framework. Not sure if you want to go that route, as it can be hairy.
 
#3
Thanks jpkelley for replying so quickly.

I only have 400 rows of data.

Your response made me wonder if perhaps I'm not asking the right questions.

"How many variables should I include?" is a question I cannot answer clearly. To build my model, I consulted an expert on what variables would be of interest, took that list and did backwards stepwise regression. I know stepwise is now frowned upon but I could not find out how to do lasso or ridge regression for logit in Stata.

Let me illustrate the background of my research:

I am interested in what would affect the probability of an employee to be graded A by a company (it is their internal rating system.). I ended up with variables like IQ, Highest Educational Attainment, Personality test scores, arithmetic scores, grammar scores, seminar activity, etc etc etc. So actually, bulk of those variables are categorical; if you are familiar with the 16 PF test, I used each category score as a variable.

Again forgive me for I only have an undergraduate understanding of regression. As I understand it, marginal effects are computed from the point on the graph using the means of all the covariates. I assumed thus that including more significant variables would reduce the residual. Including as much significant variables as I could seemed to support my thoughts as it improved the fit and the p-values of my model.

Is this approach actually wrong? Should I actually subdivide my model into smaller models?
 

jrai

New Member
#4
This is a very subjective question. You should try answering this question by theory. If you think that a variable is theoretically important then leave it in the model even if it is not significant. I had this debate with my econometric professors during my masters program & they all were in favor of keeping in a variable into the model if it is theoretically supported. Of course, if a variable is significant then it'll help you reduce the error variance/ residual.

The aim of the model should be to make it general & not overfitted. An overfitted model will perform very well on the current sample but out of sample predictions won't be accurate. You can compare two models for better out of sample prediction by comparing their AIC & BIC (choose lower ones) or by better crossvalidation results. Some researchers propose MDL as the best solution but unfortunately most of the modern packages (SAS, SPSS; I don't know about Stata) don't calculate MDL. There is excellent discussion on this in Journal of Mathematical Psychology special issue on model selection, which appeared in the March, 2000 issue (vol. 44, no. 1). Or here is a very good paper on it: Toward a Method of Selecting Among Computational Models of Cognition

I'd say compare a few models & see if the variables in those models make some good sense theoretically or not. As for testing interactions, try the ones which sound theoretically most relevant.

But 1 thing that is important is to look for segments in your population. If there are distinct segments available then it is a good idea to build separate models for different segments. The variable selection might differ for different segments. A variable might b insignificant for overall population but it might be highly significant for a particular segment.