Including rates as predictor variables


Active Member
Let's say I'm building a model to predict whether an individual voted for candidate A or candidate B in an election. I have zip code as one of the predictor variables (thousands of distinct values). Obviously, including zip code as a categorical field will not work. But, if I get a rate of votes for candidate A by zip code then I can use this new field as a numeric variable on [0,1]. Someone suggested this approach which I had never thought of before. Has anyone used this approach when dealing with many categories? Thanks


Less is more. Stay pure. Stay poor.
Well that would resolve the reference group issue. I have used zip to get census style data (median home, etc.) before.

Do you think your fine ignoring spatial relationships, since rates will covary? You could use distance to nearest urban center to get at spatial too.

Seems fine to me! Did you see those geology maps today about how 100M old landscrape helped influenced the 2020 election? Fun stuff!