How can you deal with categorical variables that aren't present in your training set?

Let's say I have the following linear regression:

earnings = c0 + age * c1 + male * c2

where "male" is a categorical variable and "female" is the dummy categorical.

let's say my training set was 75% male and 25% female.

let's also say:

c0 = 20,000
c1 = 1,000
c2 = 1 <- bear with me... for the sake of illustration, i assume gender is not a large driver of earnings.

Let's also assume this model perfectly describes the training set with R^2 = 1.

Ok, so all is fine and dandy... but then a new gender appears... let's say transgender becomes a new category in my gender data. My model has only been tuned with male and female.

Under this scenario, is it really appropriate to say you flat out can't predict earnings for the new data point?

I mean... wouldn't it be reasonable enough to just wing it and say something like c3 (for the transgender categorical) is .75, which would be a weighted average of the other categoricals? Kind of like a no-op when using the new category in f(). I'd be saying "I'm not really sure, so I'll just mute the effect".

Does this kind of technique exist (or any technique so you don't have to just toss out the row)? If so, what is it called?


TS Contributor
Re: How can you deal with categorical variables that aren't present in your training

Do you believe that transgender persons are between male and female? I would say that, depending on the society they live in, they face severe discrimination and their effect on income would not be somewhere in between men and women but something more negative. This is exactly the problem with the technique you propose: it is far from neutral but makes pretty extreme assumptions.

As long as your third category has no effect on income you can use your old model to predict the income of men and women (but not transgender). If the transgender category does influence income (a likely outcome) then you will just have to start all over again and train a new model.