hi gianmarco,
this sounds like a clear case for a PCA on the four variables first and a logistic regression on the principal components afterwards. What donyou think?
Hello,
I am planning to fit a Logistic Regression model (binary) and I would like to use (among several IVs) 4 predictors that turn out to suffer of collinearity (VIF > 15). Now, these IVs are the direct solar radiation (measured in KWh/sqm) received by the terrain in my study area across the 4 seasons.
Now, by plotting them against one another, it is quite clear that their are correlated. So, I am wondering if there is a workaround in order to keep them in the analysis, since it would be interesting to assess to what extent the different amount of solar radiation in different seasons affect the DV.
Thank you.
Best
http://cainarchaeology.weebly.com/
hi gianmarco,
this sounds like a clear case for a PCA on the four variables first and a logistic regression on the principal components afterwards. What donyou think?
Thanks rogojel,
yes, I have read of this. Just for the record, I found these articles interesting:
Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., … Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 027–046. http://doi.org/10.1111/j.1600-0587.2012.07348.x
Midia, H., Sarkara, S. K., & Ranaa, S. (2010). Collinearity diagnostics of binary logistic regression model. Journal of Interdisciplinary Mathematics, 13(3), 253–267. http://doi.org/10.1080/09720502.2010.10700699
What you refer to (PCA) is referred to in those publications: that's nice. I am familiar with PCA, but the use of its scores in a regression context is something new to me.
I will look further into the matter.
Thanks
Gm
http://cainarchaeology.weebly.com/
The true ideals of great philosophies always seem to get lost somewhere along the road..
Thanks for the reply Guys,
I have performed PCA, and it appears that the first dimension is actually explaining the great majority of the data variability (say, 91%). By the way, apologies if I use a terminology that is borrowed from Correspondence Analysis, but I hope that I manage to make my point.
So, I should use scores on that dimension as IV. That's fine, but how the interpretation in the context of LR would be. That is, assuming that that "new" IV would prove a significant predictor, what its interpretation would be? In a sense, after PCA, I would lose the 'connection' between solar radiation and the time-dimension represented by the season...
Gm
http://cainarchaeology.weebly.com/
With 91% it would seem that he first axis provides ideal means of data compression, which also reduces bias.
The interpretation would depend on your data, for instance, if I have two climate variables - for lets say eastern Europe - monthly mean rain and temperature then I can predict that these are likely to be highly correlated. High rain values corresponding to low temperature values. I can't add both to a regression model as this would lead to problems in the estimation and interpretation of the regression coefficients.
However if I summarize those to variables into a "climate index" using their factor scores on the first PCA axis, then I can include them both via this index. The interpretation would also be straight forward, all I would need to do is to look at the scatter plot of temp and rain, and add the fitted PCA axis (or plot each variable against it's factor score). In my example case I would expect to find that high scores will correspond to high temperature and low rain, while low scores correspond to low temperature and high rainfall.
You just need to do this for your data compression axis.
Last edited by TheEcologist; 07-24-2015 at 07:05 AM.
The true ideals of great philosophies always seem to get lost somewhere along the road..
gianmarco (07-24-2015)
hi,
quite often in industrial context the first component is something quite close to an average ( approximatively equal weights). That would make the interpretation quite easy. Maybe it is something close to this in your case?
regards
I would think that part of this answer would depend on your theory of how the factor performs. Which in turn would depend on how exactly you interpreted the factor (this comes from factor analysis, but I assume PCA is very similar). The OR of the factor would be interpreted exactly as you would a raw variable.
One issue is how you created the new factor from the variables. Most commonly you add the levels of each variable, but you can multiply them and many variations as well. One of these might make more sense from the perspective of the research you are doing (I don't know enough about these approaches to comment further, but it might be worth looking up).
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Tweet |