Logistic regression and correlation

Jolene

New Member
Hi
I'm building an account management scorecard with logistic regression. Some of the variables have quite large correlations, but they get selected into the same model (thus the effect of the correlation does not explain all the variance). According to Siddiqi (Credit risk scorecards) the effects of multicollinearity can be overcome by using a sufficiently large sample. My questions are:
1. Is this correct (i.e. can I ignore the correlations)?
2. How big can the correlation be to still be acceptable in the model?
3. How big is a sufficiently large sample?
Thanks a lot

Ninja say what!?!
Some questions for you:
1. what do you mean by "they get selected into the same model"? How are they being selected?
2. what do you mean by "the effect of the correlation does not explain all the variance"? PS. You'll almost never come across a model using real world data that "explains all the variance"
3. When you really are dealing with multicollinearity, yes you can overcome it with a sufficiently large sample.

1. Yes. However, whether you can ignore the correlations will also depend on a few other factors. I'll go into more detail once my questions are answered.
2. It depends. I wouldn't say there's a set size per se. However, anything close to 1 or -1 will raise lots of red flags.
3. Again, it depends. How big is the correlation and how big are the effect sizes?

Jolene

New Member
1. I use proc logistic in SAS with the selection = stepwise option to select variables.
2. Suppose variable 1 and variable 2 are highly correlated. If var 1 is in a model and var 2 is entered into that same model, var 2's significance will probably be much lower than that of var 1, since var 1 contain a large part of the information found in var 2. However, of var 2 also shows a high significance, then that correlation does not explain all the effects contained in those two variables. Thus, value is added by including the second variable into the model. (I hope you understand what I am trying to say)

I am working with correlations of about 0.75. Do you think that is too much?

antonitsin

New Member
Hi,
I think its a big correlation if you talking about partial correlation.

Can you quote you sas proc for this, its quite rare (if no lurking variables!) that variables with such partial correlation been both significance, quote p-values too please.

Ninja say what!?!
Stepwise regression is very data adaptive. The resulting model's only as good as the data you have, regardless of your hypothesis and theory. Though this method sort of indirectly addresses multicollinearity, it does not take care of it.

I agree with your sentence: ...if var 2 also shows a high significance [in addition to var 1], then that correlation does not explain all the effects contained in those two variables. Thus, value is added by including the second variable into the model.

Working with a correlation of about 0.75 is complicated. In my view, it's high enough where you will see problems with regression, but also low enough where you really have to consider if you want to drop one of the variables.