# Thread: Logistic Regression - Predicted Probabilities with unknown variables

1. ## Logistic Regression - Predicted Probabilities with unknown variables

Hi all,

I'm wondering about how to calculate predicted probabilities using a binary logistic regression model when one or more of the variables in the model is unknown.

For example, say I built a model to determine whether or not someone will own a car which uses the following:
- Age, where the outcome is 0 = over 25 years of age, 1 = below 25
- Gender, where 0 = male, 1 = female
- Home proximity to work where the outcome is divided categorically, 0= less than 10 km, 1 = 10-30 km, 2 = over 30 km
- Household income, where 0 = 0- \$30,000 , 1 = \$30,000 - \$60,000, 3 = \$Over 60,000

Now, if someone asks me to calculate the predicted probability for someone owning a car given their age, gender, and income, yet they want to know without specifying home proximity. Can this be accomplished? Can I use my model to predict this outcome holding home proximity constant?

Would I also be able to do a predicted probability when more than one variable is missing (say home proximity and gender)?

2. ## Re: Logistic Regression - Predicted Probabilities with unknown variables

Unfortunately no. You must have all cases to do prediction. Sometimes missing may fill by mean or median but this is possibile only with countinous variables. If you will MUST predict something, try using a mode value estimated form validation data set.

3. ## Re: Logistic Regression - Predicted Probabilities with unknown variables

The model was built with all of the variable data. Logistic regression usually employs listwise deletion, meaning if a value is missing for a variable that whole subject is discarded. Thus, if you estimate a predicted probability and excluded the variable that is missing in your formula - like above, it would actually be set to the reference value, even though you left it out of your equation. There are two key concepts: interpolation and extrapolation. The first is estimating a value that would have fallen within the data you have, say age 55, when your dataset had subjects aged 53 and 56, so you are assuming a linear relationship for 55 and that it would be correct to estimate it. Now the latter is for values outside of your data, so age = 89, when your highest dataset value was 56, so it is dangerous to predict for 89 since it is so far outside the scope of your data. Your problem is neither of these, since you don't even know the age. Sorry, I am just using age as an example.

A wonky option may be to rerun the model without the missing variable, to get an estimate. So exclude that term in the model. It may be interesting to see if the variables you left in the model change much, so whether they truly are independent of the other variates that didn't have missing data. If they don't change much, that is a good thing. You could then get your estimate. However, you cannot compare that estimate or reference it to the other estimates, since they come from two different models.

Another option is to impute all of the missing values and run the full model on all of the imputes and pool your results. You could then statistically assume a value for the missingness and use your new estimates in your equation that does include the term in the model.

 Tweet