Continuous vs categorical variable in regression

#1
Dear all,

In doing regression analysis, I am sometimes confused with whether I should use a continuous predictor variable (e.g., age in years) or re-group it into a categorical variable (e.g., 5 age groups: 18 to 24, 25 to 34, 35 to 44, 45 to 54, and 55+). What is the best way to decide?

Any comments or references are welcome. Many thanks.

James
 

Karabiner

TS Contributor
#2
The grouping is artifical and it will cost you 4 degrees of freedom, instead of only one degree of freedom.
So it seems wise not to group. What would you achieve by it?

With kind regards

Karabiner
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
I agree with @Karabiner - splits are generally arbitrary. If there really is a biological, policy, or social phenomenon at play; you should look at those effects using splines (general additive models [GAM]) or polynomials. Splitting variable in piecewise regression could be acceptable in some scenarios, but likely after visualizing them via GAMs and validating them in a holdout dataset. Piecewise splits typically also assume linear relationships before and after inflection points if polynomials or GAM not used - so that may misrepresent your data as well if not validated based on theory or bare minimum - empirically.
 

noetsi

Fortran must die
#4
I have always hear you lose information, I assume degrees of freedom, when you categorize interval variables. Having said that the federal agency I am doing an analysis for split age into 7 variables which left me very puzzled.
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
One way you are losing information is because you are saying the top value in one group is different than the bottom value in the next group - when these values could be figuratively identical on any given day.