Question about interaction in regression

#1
It's my first time to write here to ask question.
Please, if someone know any clue for my problem, help me.

I'm running logistic regression for outcome on SBP(systolic blood pressure), DBP(diastolic blood pressure), BMI , and other covariates.

Because SBP, BMI didn't show linear relationship with logit of outcome, I changed these continuous variables to categorical variable by quintile.

Below is the models I tried.
Model1 : Logistic outcome covariates BMI SBP
Model2 : Logistic outcome covariates BMI DBP
Model3 : Logistic outcome covariates BMI SBP DBP
Model4 : Logistic outcome covariates BMI SBP DBP SBP*DBP

This is model 1, model 2

This is model 3


Because in model 3, there was a significance change in 2nd 3rd of SBP and 3rd 4th of DBP compared with model 1, model2, so I guess 'Is there any collinearity between SBP and DBP?'

But all VIF values were below 3.

So then i would like to check interaction between SBP and DBP.

So I made interaction term using categorical SBP and DBP, it resulted in 16 level variable.

This is model 4

All levels of interaction term were not significant, but SBP and DBP lost it's significance in most of the level.

So I have some question about my analysis.

(1) Do I have to include interaction term into the model or not ? Do you think is there any collinearity or interaction between SBP and DBP?

(2) How can I interpret difference between model 1,2 vs model 3.

(3) Can only small VIF value confirm absence of multicollinearity? I ask this question cause condition index was 88.68. How low VIF and high condition index are present same time?

Please help me....
Best
 

hlsmith

Not a robit
#2
How did you establish they werent linear to logit? How are reference determined? Ever think about using general additive model? What is your dependent variable?
 
#4
How did you establish they werent linear to logit? How are reference determined? Ever think about using general additive model? What is your dependent variable?
To hlsmith

Thank you for reply

1) outcome variable is prevalence of proteinuria.
cross sectional study, same age, male, 300,000 observations.

2) I checked linearity between logit & continuous variable by visualizing using Generalized additive model.
below is graph of SBP, DBP, BMI with proteinuria.

It seems to nonlinear in SBP and BMI, is it right?
I wanted to prove reverse-J shape relationship between SBP & proteinuria, so I categoried SBP into quintile.
Since I changed the SBP to a categorical variable, I also had to change the DBP for equity,
and since the BMI also showed a U shape, I changed it to a categorical variable.

3) Then I set the reference of categorical variable showing lowest odd ratio, for preference of interpretation.

Here is some similar article.
https://www.sciencedirect.com/science/article/pii/S0917504017301211

gam.png
 
Last edited:
#5
Thank you for your reply.

In article dealing with aneurysm, they recommend theoretic or clinical cut-point instead of data-driven cut-points.

If I choose the clinically used cut off levels of 120 and 140 (<120,120-140, >140), would there be any problems with catgorical variables?

Or do I have to use fractional polynomials or cubic spline model for Nonlinear relations?
 

spunky

Doesn't actually exist
#6
Or do I have to use fractional polynomials or cubic spline model for Nonlinear relations?
This sounds more like it. It's better practice to adopt the proper model that can handle your type of data rather than force the data to comply with the model's assumptions. Once you start truncating, converting, removing outliers, etc. it brings the question of "researcher degrees of freedom" or "the garden of forking paths". Like in your case... why choose quintile? Why not quartiles? Or tertiles?

Darrell Huff's quote comes to mind: "If you torture the data long enough, it will confess to anything"
 

hlsmith

Not a robit
#7
Biological plausibility should guide decisions as well. SBP may be linear as well, since the tail is sparse with observations (though 300k is a big number). I am guessing you used the mcg package in r for these, correct? I am intrigued that all four had 4 degrees of freedom. Did you select that or did the model generate it? I haven't used mcg, could you post that code snippet. Also, when you fit these did you use the saturated model (multiple logistic regression)?
 
#8
This sounds more like it. It's better practice to adopt the proper model that can handle your type of data rather than force the data to comply with the model's assumptions. Once you start truncating, converting, removing outliers, etc. it brings the question of "researcher degrees of freedom" or "the garden of forking paths". Like in your case... why choose quintile? Why not quartiles? Or tertiles?

Darrell Huff's quote comes to mind: "If you torture the data long enough, it will confess to anything"
My assumption was, SBP shows something nonlinear relationship with outcome than linear.
I just choose quintile for the purpose easy interpretation of the result .
Quintile shows more flexible change of odds ratio than quartile, tertile.
I could choice the regression which dealing with nonlinear relationship. Like fraction polynomials, peicewise regression....
Do you mean , should I have chosen these analysis?
 
#9
Biological plausibility should guide decisions as well. SBP may be linear as well, since the tail is sparse with observations (though 300k is a big number). I am guessing you used the mcg package in r for these, correct? I am intrigued that all four had 4 degrees of freedom. Did you select that or did the model generate it? I haven't used mcg, could you post that code snippet. Also, when you fit these did you use the saturated model (multiple logistic regression)?
Than you for reply.
I used gam package in R , selecting the df 4 , and in multivariable model.

logitgam1<-gam(upro ~ s(bmi,df=4)+ s(sbp,df=4)+ s(dbp,df=4)+ s(gfr,df=4), data=bp, family=binomial)
summary(logitgam1)
plot(logitgam1,se=T)

I also tried to draw logit(proteinuria) & sbp graph with STATA using lowess code
lowess upro sbp, logit
This is univariate setting. Graph is like below. Red line means 1~99 percentile range.
I think SBP shows nonlinear U shape.

11.png
 

hlsmith

Not a robit
#10
I was alluding to, what happens to the plots when you change the degrees of freedom? Typically you would fit them with splines or something instead of breaking them up. You are only suppose to break them if they is a true underlying phenomenon occurring. If you are trying to publish this, you may do both I guess.

Do you have a biologically plausible justification for an interaction or are you just fish for associations given data?