Converting a continuous variable into a categorical variable (low, medium, high)

#1
I have a continuous dependent variable (scores on an exam) and a number of predictor variables that I'd like to use in a regression analysis, such as gender, age, and income. I'd like to convert age and income into categorical variables: low, medium, and high. I'd like to do this using percentiles, but am not sure if I should use tertiles (lower 33% = low, middle 33% = medium, upper 33% = high) or if I should divide the data into lower 25% (low), middle 50% (medium), and upper 25% (high).

Thoughts? Thank you!
 

Karabiner

TS Contributor
#2
I'd like to convert age and income into categorical variables: low, medium, and high.
Why? What for? You'll throw away statistical Information and you'll
create artificial groups, which could be meaningless. Usually, the
interval scaled variable is perfect as a predictor.
I'd like to do this using percentiles, but am not sure if I should use tertiles (lower 33% = low, middle 33% = medium, upper 33% = high) or if I should divide the data into lower 25% (low), middle 50% (medium), and upper 25% (high).
So you want to use sample data in order to define your groups.
Your defintion will then be sample specific. To wehat could the results
be generalized? The next study with the next sample, or the Population
will have other 33% etc. limits.

Moreover, if most participants are poor, or most of them have medium
income, or most of them are wealthy, you'll define people as
middle/medium/high who aren't.

With kind regards

K.
 

hlsmith

Omega Contributor
#3
As long as there is a linear-esque relationship between the continuous variable and the outcome, ideally you never categorize it. You get a loss of information, etc.


The only exception I may think of is that you are not disseminating results and it is purely for inhouse use and your sample is pretty complete. But if you are looking to share your results, it can be difficult to generalize results to other samples or populations if the cut rules were developed just using your own sample set.
 
#4
Thanks for your advice. The main reason I wanted to categorize the data is that the incomes are estimated using zip codes/Census data, so they’re not perfect. I also found the B values from my regression to seem more meaningful, because rather than a fraction of a point increase in score per dollar income, there was a larger change in score by category. Please let me know if this changes your opinion or not.

Also, if there wasn’t a linear relationship between the variables and it was desirable to convert the continuous variable into categories, would you use tertiles or 25-50-25?

Thanks!
 

hlsmith

Omega Contributor
#5
If the variables are not naturally a continuous variable, then conversion may be more acceptable.


The plotting of the relationship between the variables is import in understanding linearity. Options include scattergraphs, loess curves, and general additive models (splines). Tertiles may be dangerous to use in these situations, you want to first determine where changes in slopes occur (knots), and some times just simple piecewise regression or data transformations (logging or polynomials) are good choices. But this is given there isn't a monotonic relationship and not accounting for non-monotoncity would be inappropriate. If there is a linear relationship, moving slowly is fine, but if you have a sinewy or say quadratic going on, you need to address it.