Logistic Regression: collinearity btw continuous and categorical predictors?

gianmarco

TS Contributor
#1
Hi All,
I am still working on fitting a binary Logistic Regression model in which the DV is land 'optimal vs. non-optimal quality', and the IVs are both continuous (e.g., elevation, slope, distance from the coast, etc) and categorical (soil types [with 5 levels], geology [4 levels]).

In the last months I had good time reading a lot on LR. So far so good. What I am wondering now is if the following situation can be considered a case of 'collinearity' among two predictors.
I have plotted notched boxplots of elevation by soil types, and it seem that there is a significant tendency for some soil types to have larger elevation values (here, I am basing this statement on a broad definition of the Wruskal-Wallis test). What I am concerned about is to use soil types AND elevation as predictors. Can the described situation represent a 'collinearity' issue?
If it can, what would be the more sound approach: to retain just one of the two? Further, shall I have to repeat the same 'screening' for all the other continuous IVs?

Best
Gm
 
#2
Hi All,
I have plotted notched boxplots of elevation by soil types, and it seem that there is a significant tendency for some soil types to have larger elevation values (here, I am basing this statement on a broad definition of the Wruskal-Wallis test). What I am concerned about is to use soil types AND elevation as predictors. Can the described situation represent a 'collinearity' issue?
Hi Gian. Just throwing my two cents. I guess it is a case of collinearity. Perhaps, besides the Kruskal Wallis test, a bivariate correlation coefficient might be better for evaluation. And as you better know, it is better to check for the VIF. However, I think not all statistically significant correlations necessarily matter (as a small correlation might become statistically significant in a large sample). The ones that have a great effect size are dangerous. So could you also elaborate on the VIF of the variables pertaining to soil types and the variable elevation, as well as the Spearman coefficients between soil types and elevation?


If it can, what would be the more sound approach: to retain just one of the two? Further, shall I have to repeat the same 'screening' for all the other continuous IVs?
I think it would be the best to always check for the collinearity between every and all involved variables, and then find and exclude the culprits. If you only assess the collinearity between 2 independent variables and exclude your model, your model might still be affected by other hidden collinearity cases. I the best method might be to modeling all the independent variables and then assessing their VIFs. Also the assessment of the correlation matrix between all the independent variables is another good approach that can highlight potential culprits.

How to know which variable to keep? I don't know a definitive answer for this. What I know is that finding the optimum model might sometimes take up to one month. I mean, probably, it is not that simple to say which variable to keep and which to exclude. In any case, you should try to keep the variable that is theoretically more important in your model. Moreover, you can add and remove variables and see the -2 log likelihood. Then you can compare the -2 log likelihoods to see if the model has improved or worsened by removing a variable, or not? Also you can use LRT tests to compare these -2 log likelihood values, statistically.

If both of those two highly correlated independent variables are theoretically essential, a suggestion might be to conduct two similar regression models, each with one of those two independent variables and excluding the other one.

This is a question of mine too and I would appreciate any update on the way to deal with such a situation (which variable to exclude?).
 

gianmarco

TS Contributor
#3
Thank Vict for the detailed suggestions. Much appreciated.
As for the VIF, yes, I will surely take into account the VIF of each individual predictor. On top of that, before fitting the model, I preliminarily checked the correlation between the IVs. This is quite easy for continuous vs continuous predictors (i.e., I used Pearson's r), or for categorical vs categorical predictors (i.e., I used chi-squared and Cramer's V as a measure of association). The problem was for continuous vs categorical. As for this, what do you mean by:
a bivariate correlation coefficient might be better for evaluation
There is something missing by me here....

Also, can you provide an example in R on how to get the change in -2 log likelihood, and about LRT tests (which seems rather new to me).

Thanks
Gm
 
#4
You are most welcome :eek:
Gian you are right, I didn't explain well. Sorry.

By the "variables pertaining to soil types" in the following sentence,

elaborate on the VIF of the variables pertaining to soil types and the variable elevation, as well as the Spearman coefficients between soil types and elevation?
I meant the dummy variables of soil.

This is because, there is actually no variable "soil" but a compilation of different binary dummy variables, each named as a specific soil type. So the independent variables are not "soil" and "elevation". Instead, the real independent variables are the continuous variable "elevation" plus a number of binary variables "BrownRendz", "CarbonateRaws", "TerraRossa", etc. Each of these dummy variables (ie, the sole types) are either zero or one. Therefore their correlation with the continuous variable "elevation" can be evaluated using VIF, Spearman, or other correlation coefficients.

I don't know about other software, but at least SPSS automatically creates these binary dummy variables when we hand it a categorical variable. However, in certain situations, we need to dummy code the variable, ourselves, since SPSS doesn't do it. In your case, I think you should create 5 new binary variables for the 5 soil types and fill zeros and ones in the cells, accordingly. Then we can throw out the categorical variable "soil" and continue with the more convenient set of 5 binary variables we have created.

Hope it was more clear (and correct :D ) but if I am missing anything or had an error, please let me know. :)
 

gianmarco

TS Contributor
#5
Vict, your explanation is ok in both instances.
The only 'problem' that I had was to grasp the idea of a correlation between a continuous and dummy-coded variable. What kind of correlation coefficient shall I use? Spearman?
And what about the last part of my post:
Also, can you provide an example in R on how to get the change in -2 log likelihood, and about LRT tests (which seems rather new to me).
Thnx
Gm
 
#6
I see Gian. Sorry for the bull**** response :D

Gian, the ideal test for that correlation is Point Biserial correlation coefficient. The formula is Exactly similar to the formula of the Pearson correlation coefficient. So you can do a Pearson between a binary and a continuous variable, and call the results as the result of a point biserial coefficient.

But I have tested Spearman in similar situations and honestly, both P values and coefficients of Spearman and Point Biserial are very very close most of the time, especially when the sample size is a little bit large.

Regarding your second question, I am not familiar with R code. But this search results might help.

https://en.wikipedia.org/wiki/Likelihood-ratio_test
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q3/001175.html
http://www.rdocumentation.org/packages/difR/functions/LRT