# Thread: Logistic Regression: how can I handle IVs which are correlated?

1. ## Logistic Regression: how can I handle IVs which are correlated?

Hello,
I am planning to fit a Logistic Regression model (binary) and I would like to use (among several IVs) 4 predictors that turn out to suffer of collinearity (VIF > 15). Now, these IVs are the direct solar radiation (measured in KWh/sqm) received by the terrain in my study area across the 4 seasons.

Now, by plotting them against one another, it is quite clear that their are correlated. So, I am wondering if there is a workaround in order to keep them in the analysis, since it would be interesting to assess to what extent the different amount of solar radiation in different seasons affect the DV.

Thank you.
Best

2. ## Re: Logistic Regression: how can I handle IVs which are correlated?

hi gianmarco,
this sounds like a clear case for a PCA on the four variables first and a logistic regression on the principal components afterwards. What donyou think?

3. ## The Following 2 Users Say Thank You to rogojel For This Useful Post:

gianmarco (07-24-2015), hlsmith (07-24-2015)

4. ## Re: Logistic Regression: how can I handle IVs which are correlated?

Thanks rogojel,

yes, I have read of this. Just for the record, I found these articles interesting:
Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., … Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 027–046. http://doi.org/10.1111/j.1600-0587.2012.07348.x

Midia, H., Sarkara, S. K., & Ranaa, S. (2010). Collinearity diagnostics of binary logistic regression model. Journal of Interdisciplinary Mathematics, 13(3), 253–267. http://doi.org/10.1080/09720502.2010.10700699

What you refer to (PCA) is referred to in those publications: that's nice. I am familiar with PCA, but the use of its scores in a regression context is something new to me.
I will look further into the matter.
Thanks
Gm

5. ## Re: Logistic Regression: how can I handle IVs which are correlated?

Originally Posted by gianmarco
Hello,
I am planning to fit a Logistic Regression model (binary) and I would like to use (among several IVs) 4 predictors that turn out to suffer of collinearity (VIF > 15). Now, these IVs are the direct solar radiation (measured in KWh/sqm) received by the terrain in my study area across the 4 seasons.

Now, by plotting them against one another, it is quite clear that their are correlated. So, I am wondering if there is a workaround in order to keep them in the analysis, since it would be interesting to assess to what extent the different amount of solar radiation in different seasons affect the DV.

Thank you.
Best
I totally agree with rogorel. In this case a data reduction technique as a PCA would be best. Simply calculate the PCA axis that best describes the 4 variables, and use the factor scores as a "radiation index". This is done all the time in big data.

Cheers,

TE

6. ## Re: Logistic Regression: how can I handle IVs which are correlated?

I have performed PCA, and it appears that the first dimension is actually explaining the great majority of the data variability (say, 91%). By the way, apologies if I use a terminology that is borrowed from Correspondence Analysis, but I hope that I manage to make my point.
So, I should use scores on that dimension as IV. That's fine, but how the interpretation in the context of LR would be. That is, assuming that that "new" IV would prove a significant predictor, what its interpretation would be? In a sense, after PCA, I would lose the 'connection' between solar radiation and the time-dimension represented by the season...

Gm

7. ## Re: Logistic Regression: how can I handle IVs which are correlated?

Originally Posted by gianmarco
I have performed PCA, and it appears that the first dimension is actually explaining the great majority of the data variability (say, 91%). By the way, apologies if I use a terminology that is borrowed from Correspondence Analysis, but I hope that I manage to make my point.
So, I should use scores on that dimension as IV. That's fine, but how the interpretation in the context of LR would be. That is, assuming that that "new" IV would prove a significant predictor, what its interpretation would be? In a sense, after PCA, I would lose the 'connection' between solar radiation and the time-dimension represented by the season...

Gm
With 91% it would seem that he first axis provides ideal means of data compression, which also reduces bias.

The interpretation would depend on your data, for instance, if I have two climate variables - for lets say eastern Europe - monthly mean rain and temperature then I can predict that these are likely to be highly correlated. High rain values corresponding to low temperature values. I can't add both to a regression model as this would lead to problems in the estimation and interpretation of the regression coefficients.

However if I summarize those to variables into a "climate index" using their factor scores on the first PCA axis, then I can include them both via this index. The interpretation would also be straight forward, all I would need to do is to look at the scatter plot of temp and rain, and add the fitted PCA axis (or plot each variable against it's factor score). In my example case I would expect to find that high scores will correspond to high temperature and low rain, while low scores correspond to low temperature and high rainfall.

You just need to do this for your data compression axis.

8. ## The Following User Says Thank You to TheEcologist For This Useful Post:

gianmarco (07-24-2015)

9. ## Re: Logistic Regression: how can I handle IVs which are correlated?

hi,
quite often in industrial context the first component is something quite close to an average ( approximatively equal weights). That would make the interpretation quite easy. Maybe it is something close to this in your case?

regards

10. ## Re: Logistic Regression: how can I handle IVs which are correlated?

I would think that part of this answer would depend on your theory of how the factor performs. Which in turn would depend on how exactly you interpreted the factor (this comes from factor analysis, but I assume PCA is very similar). The OR of the factor would be interpreted exactly as you would a raw variable.

One issue is how you created the new factor from the variables. Most commonly you add the levels of each variable, but you can multiply them and many variations as well. One of these might make more sense from the perspective of the research you are doing (I don't know enough about these approaches to comment further, but it might be worth looking up).

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts