Which potential predictors to add to the logistic regression model?

#1
I am assessing the effects of some covariables and categorcial variables (exactly 5 independent variables) on a binary outcome. When I enter the variables, one of them becomes significant. When I enter also their interactions (interactions between all possible pairs of the independent variables), some other variables (and some interactions) become significant and the previously significant one becomes non-significant!

I have seen that previous studies first have run bivariate chi-square tests and then entered only those variables with chi-square P values < 0.1 or smaller than 0.15 into the model.

I don't know if it is a valid method. besides, I don't know what to do when it comes to "deciding on adding the interactions".

I would appreciate any suggestions and advice.
 

trinker

ggplot2orBust
#2
Are the interactions meaningful? My opinion is that if they're not part of your research questions, they themselves aren't significant then leave them out. But there are more qualified people than I, that will likely chime in here. I automatically defer to them. :)
 
#3
Some of them are significant. Most of the pair-wise interactions are important. They have been pointed to, in some previous researches so it is good to test them as well. But a question in mind is that if inclusion of them can change the results, how can the researcher or me in particular! be sure that the significant results outputted by one model is correct? Another funny thing is that the exactly same model with the same interactions (perhaps with a slightly different order only) gives different results in different packages!
I automatically appreciate your generous help! :)
 

ledzep

Point Mass at Zero
#4
May be not entirely related to this question, here is a general strategy for perform model building (to your question on chat box...)

1. Always keep the known confounders in the model regardless of their significance. For e.g. Age which is a surrogate of immunity affects the disease outcome [This may not be true for all diseases but will depend on the scientific knowledge. You should be prepared to get out your statistical purity and respect the science where there is a sound reason. Hence, always keep the known confounders in the model, significant or not significant.

2. Use all those univariate risk factors significant at 10% or 20% level for model building [Although valid, I prefer likelihood ratio test..]

3. Use a likelihood ratio test (LRT) to determine which variables to include in the model [Don't use wald p-value of the model coefficients!!!]. A LRT tests how much the model is improved by adding the variable.

4. If a LRT is not significant i.e. if the variable doesn't significantly improve the model fit, check the effect of dropping the variable on the remainder model coefficients.

5. For the final multivariate model, perform residual analysis, robustness of model fit. Check the effect of adding of all the variables you had left out to see if it alters the results. If possible, check for interaction for the variables in your final model.


Once you've come up with a final model, perform cross-validation, robustness testing, sensitivity analysis...

Main thing to check with your model: Does your model makes sense? Are they realistic?

I have seen that previous studies first have run bivariate chi-square tests and then entered only those variables with chi-square P values < 0.1 or smaller than 0.15 into the model.
It is an acceptable and widely used method to include only those univariate risk factors singificant at 10% or 20% level. However, Likelihood ratio test is preferred, as said before.

A Likelihood Ratio is answer to almost any question :) However, you have to include scientific knowledge to the model building process.
 
#5
Ledzep, I highly appreciate your super comprehensive comment. I will use it as a future reference too. :)

It seems that modeling needs some considerable degree of subjective involvement then:

Main thing to check with your model: Does your model makes sense? Are they realistic?
However, you have to include scientific knowledge to the model building process.
I hope the model chosen by me as realistic and reasonable is not the most good-looking one!

Different models have ended to one very nice model with most of variables being significant!! I am so tempted to see it as the most realistic one now :D But I seriously am! Besides, that model doesn't seem bad or unrealistic.

The point is that according to the Spearman correlation, there is no such significant correlations between each of the independent variables with the outcome. So it is very interesting that a combination of ALL the same 5 predictors plus only 2 pair-wise interactions leads to a model in which all the predictors and interactions (except one predictor) are significant! So it is nice to see when I adjust for the other predictors in a multivariate model, much more significance appears than when correlating each of the IVs with the outcome! Here I think I should rely more on that likelihood ratio test instead of my evil temptations! :)

edit: The good news is that that above-mentioned successful model has also one of the smallest log-likelihoods. It is not the smallest one, but is the second smallest one. So I can approach it with more confidence. Especially if I test and see that there is no significant difference between the two smallest LLs.