Logistic Regression Problem


My collegue is developing a logistic regression to predict the probability of a customer taking up a particular credit card.

Let's call the credit card the gold card, so his dependent variable on the left side is GoldCard = 0,1. On the right side of his equation he is also including a variable for overall credit card indicator that equals 1 if the customer has ANY credit card. Therefore any observation where GoldCard=1 the overall credit card indicator also equals 1.

It seems to me that this would result in an artificially high accuracy because the two variables are perfectly correlated. A comperable example would be developing a model that predicts gender and then including a gender variable on the right side.. And you'd never do that, right?

Anyways, I'm pretty sure it's incorrect to include the overall card indicator but I'm not knowledgeable or articulate enough to explain why. Any info/insight would be greatly appreciated.



No cake for spunky
The two variables won't be perfectly correlated. That would only be true if there was only one credit card available. Many of the zeros on the left side will be ones on the right side (that is on the predictor variable); these will reflect people who had a credit card other than the gold card. If the variables were perfectly correlated than there would be no valid regression equation and many softwares won't run (this is perfect collinearity).

I would agree that it might artificially inflate the prediction (it might also reduce variability which causes attentuation of the slope in extreme cases). But the real question I think is not the methods issue, it is what is the theoretical reason to include this variable. Obviously if people chose to have a credit card compared to not having it, this would increase the chance of having one specific card. But what does that tell you? It sort of is like predicting eating by using as a predictor whether one is hungry or not. It might predict it, but you have learned very little.
Thanks for the input, noetsi.

You're right that it's not perfect collinearity. We're back testing the model on previous campaigns right now so it'll be interesting to see if it predicts enough of the actual responses, or if it only predicts the small percentage of customers in the campaign that already had a card and got the gold card as their second card.


No cake for spunky
I would be interested in hearing what you find and methods issues. I have spent a lot of time working with logistic regression in the context of SAS, but I rarely get to use it or see it applied to practical issues.

One thing of note. You probably will get faster responses here if you put this type of thread in the regression rather than probability forum.
Ran the model against previous campaign and it did a terrible job discerning take-up. It had essentially the same response rate in the 0.00 - 0.10 score range as in the 0.90 - 1.00 range.


TS Contributor
I think, if I understood the case correctly, that it only makes sense to have the value 1 for the column ANY credit card. I mean, a even turtle can predict that if a customer has NO credit card at all then he does not have a Gold card either.

But in this case you get a column of 1s which will not help in the prediction at all.

Do I miss something hete?