+ Reply to Thread
Results 1 to 8 of 8

Thread: Logistic Regression: how can I handle IVs which are correlated?

  1. #1
    TS Contributor
    Points: 40,089, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Downloads
    gianmarco's Avatar
    Location
    Italy
    Posts
    1,367
    Thanks
    232
    Thanked 301 Times in 225 Posts

    Logistic Regression: how can I handle IVs which are correlated?




    Hello,
    I am planning to fit a Logistic Regression model (binary) and I would like to use (among several IVs) 4 predictors that turn out to suffer of collinearity (VIF > 15). Now, these IVs are the direct solar radiation (measured in KWh/sqm) received by the terrain in my study area across the 4 seasons.

    Now, by plotting them against one another, it is quite clear that their are correlated. So, I am wondering if there is a workaround in order to keep them in the analysis, since it would be interesting to assess to what extent the different amount of solar radiation in different seasons affect the DV.

    Thank you.
    Best
    http://cainarchaeology.weebly.com/

  2. #2
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?

    hi gianmarco,
    this sounds like a clear case for a PCA on the four variables first and a logistic regression on the principal components afterwards. What donyou think?

  3. The Following 2 Users Say Thank You to rogojel For This Useful Post:

    gianmarco (07-24-2015), hlsmith (07-24-2015)

  4. #3
    TS Contributor
    Points: 40,089, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Downloads
    gianmarco's Avatar
    Location
    Italy
    Posts
    1,367
    Thanks
    232
    Thanked 301 Times in 225 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?

    Thanks rogojel,

    yes, I have read of this. Just for the record, I found these articles interesting:
    Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., … Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 027–046. http://doi.org/10.1111/j.1600-0587.2012.07348.x

    Midia, H., Sarkara, S. K., & Ranaa, S. (2010). Collinearity diagnostics of binary logistic regression model. Journal of Interdisciplinary Mathematics, 13(3), 253–267. http://doi.org/10.1080/09720502.2010.10700699

    What you refer to (PCA) is referred to in those publications: that's nice. I am familiar with PCA, but the use of its scores in a regression context is something new to me.
    I will look further into the matter.
    Thanks
    Gm
    http://cainarchaeology.weebly.com/

  5. #4
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?

    Quote Originally Posted by gianmarco View Post
    Hello,
    I am planning to fit a Logistic Regression model (binary) and I would like to use (among several IVs) 4 predictors that turn out to suffer of collinearity (VIF > 15). Now, these IVs are the direct solar radiation (measured in KWh/sqm) received by the terrain in my study area across the 4 seasons.

    Now, by plotting them against one another, it is quite clear that their are correlated. So, I am wondering if there is a workaround in order to keep them in the analysis, since it would be interesting to assess to what extent the different amount of solar radiation in different seasons affect the DV.

    Thank you.
    Best
    I totally agree with rogorel. In this case a data reduction technique as a PCA would be best. Simply calculate the PCA axis that best describes the 4 variables, and use the factor scores as a "radiation index". This is done all the time in big data.

    Cheers,

    TE
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  6. #5
    TS Contributor
    Points: 40,089, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Downloads
    gianmarco's Avatar
    Location
    Italy
    Posts
    1,367
    Thanks
    232
    Thanked 301 Times in 225 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?

    Thanks for the reply Guys,
    I have performed PCA, and it appears that the first dimension is actually explaining the great majority of the data variability (say, 91%). By the way, apologies if I use a terminology that is borrowed from Correspondence Analysis, but I hope that I manage to make my point.
    So, I should use scores on that dimension as IV. That's fine, but how the interpretation in the context of LR would be. That is, assuming that that "new" IV would prove a significant predictor, what its interpretation would be? In a sense, after PCA, I would lose the 'connection' between solar radiation and the time-dimension represented by the season...

    Gm
    http://cainarchaeology.weebly.com/

  7. #6
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?

    Quote Originally Posted by gianmarco View Post
    Thanks for the reply Guys,
    I have performed PCA, and it appears that the first dimension is actually explaining the great majority of the data variability (say, 91%). By the way, apologies if I use a terminology that is borrowed from Correspondence Analysis, but I hope that I manage to make my point.
    So, I should use scores on that dimension as IV. That's fine, but how the interpretation in the context of LR would be. That is, assuming that that "new" IV would prove a significant predictor, what its interpretation would be? In a sense, after PCA, I would lose the 'connection' between solar radiation and the time-dimension represented by the season...

    Gm
    With 91% it would seem that he first axis provides ideal means of data compression, which also reduces bias.

    The interpretation would depend on your data, for instance, if I have two climate variables - for lets say eastern Europe - monthly mean rain and temperature then I can predict that these are likely to be highly correlated. High rain values corresponding to low temperature values. I can't add both to a regression model as this would lead to problems in the estimation and interpretation of the regression coefficients.

    However if I summarize those to variables into a "climate index" using their factor scores on the first PCA axis, then I can include them both via this index. The interpretation would also be straight forward, all I would need to do is to look at the scatter plot of temp and rain, and add the fitted PCA axis (or plot each variable against it's factor score). In my example case I would expect to find that high scores will correspond to high temperature and low rain, while low scores correspond to low temperature and high rainfall.

    You just need to do this for your data compression axis.
    Last edited by TheEcologist; 07-24-2015 at 07:05 AM.
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  8. The Following User Says Thank You to TheEcologist For This Useful Post:

    gianmarco (07-24-2015)

  9. #7
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?

    hi,
    quite often in industrial context the first component is something quite close to an average ( approximatively equal weights). That would make the interpretation quite easy. Maybe it is something close to this in your case?

    regards

  10. #8
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Logistic Regression: how can I handle IVs which are correlated?


    I would think that part of this answer would depend on your theory of how the factor performs. Which in turn would depend on how exactly you interpreted the factor (this comes from factor analysis, but I assume PCA is very similar). The OR of the factor would be interpreted exactly as you would a raw variable.

    One issue is how you created the new factor from the variables. Most commonly you add the levels of each variable, but you can multiply them and many variations as well. One of these might make more sense from the perspective of the research you are doing (I don't know enough about these approaches to comment further, but it might be worth looking up).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats