Hi!
My collegue is developing a logistic regression to predict the probability of a customer taking up a particular credit card.
Let's call the credit card the gold card, so his dependent variable on the left side is GoldCard = 0,1. On the right side of his equation he is also including a variable for overall credit card indicator that equals 1 if the customer has ANY credit card. Therefore any observation where GoldCard=1 the overall credit card indicator also equals 1.
It seems to me that this would result in an artificially high accuracy because the two variables are perfectly correlated. A comperable example would be developing a model that predicts gender and then including a gender variable on the right side.. And you'd never do that, right?
Anyways, I'm pretty sure it's incorrect to include the overall card indicator but I'm not knowledgeable or articulate enough to explain why. Any info/insight would be greatly appreciated.
Thanks,
Ben
My collegue is developing a logistic regression to predict the probability of a customer taking up a particular credit card.
Let's call the credit card the gold card, so his dependent variable on the left side is GoldCard = 0,1. On the right side of his equation he is also including a variable for overall credit card indicator that equals 1 if the customer has ANY credit card. Therefore any observation where GoldCard=1 the overall credit card indicator also equals 1.
It seems to me that this would result in an artificially high accuracy because the two variables are perfectly correlated. A comperable example would be developing a model that predicts gender and then including a gender variable on the right side.. And you'd never do that, right?
Anyways, I'm pretty sure it's incorrect to include the overall card indicator but I'm not knowledgeable or articulate enough to explain why. Any info/insight would be greatly appreciated.
Thanks,
Ben