Discretization of a continuous variable

#1
He guys,

I'm analyzing a research regarding patients and their risk factors for developing a certain side effect.
If i enter age as a continous variable into forward stepwise logistic regression - sex and the diagnosis of the patient have significant OR's , age is not included.
However, if i say, i believe children will have more side effects, so i discretize the age variable into <16 and >16 years nominal variable called kids. when instering this variable instead of age into the stepwise logistic regression it's significant and have an amazing OR.

How can this be? I could have chosen any random age, and try it out...
What is the real result?

A detailed explanation of this difference will be very appreciated
Amir.
 

hlsmith

Omega Contributor
#2
You need to try and plot the relationship. Perhaps it is not a linear or monotonic relationship.

I typically use the trade off between sensitivity and specificity to determine the best cut off for a continuous variable, but first you would want to understand the relationship between the variables.

You can also look at OR for greater than 1 unit increase in age. It should not affect its significances, but I believe it effects the effect size of the OR.
 
#3
so you're suggesting ROC curve? well i usually do that in cases the continous variable is indeed significant in the regression...
I didn't understand the OR>1 remark - SPSS didn't give an OR for age since it's not significant...and didn't enter the analysis in the stepwise method..
 
#4
so you're suggesting ROC curve? well i usually do that in cases the continous variable is indeed significant in the regression...
I didn't understand the OR>1 remark - SPSS didn't give an OR for age since it's not significant...and didn't enter the analysis in the stepwise method..
 

noetsi

Fortran must die
#5
I don't know why this is occuring, I am not a big fan of stepwise, but it is almost always a bad idea to convert a variable that is continuous to one that is categorical. You lose information in the process.

I think what Hlsmith is suggesting is that rather than look at what the OR is for a 1 year increase in age, you look at what the OR is for a five or ten year increase in age. Or whatever your specific units are.
 
#6
I don't know why this is occuring, I am not a big fan of stepwise, but it is almost always a bad idea to convert a variable that is continuous to one that is categorical. You lose information in the process.

I think what Hlsmith is suggesting is that rather than look at what the OR is for a 1 year increase in age, you look at what the OR is for a five or ten year increase in age. Or whatever your specific units are.
how do i do it in SPSS?
 

hlsmith

Omega Contributor
#7
Not an SPSS user, but if it is not apparent - one method might be to round all values (e.g., nearest 5 year increment).

I was just trying to say, ignoring your stepwise approach, that you could use the tradeoff between the SEN and SPEC to find a cutoff if you go that route. Since you seem to be asking how you find the best way to discretize your data. Yes the accuracy value in ROC curve could be used as well.
 

noetsi

Fortran must die
#8
SAS has a specific way to change the unit and I am sure SPSS does as well (although I don't work with it).

If all else fails you can divide the data by 5 say or 10 and import that into SPSS. So five years, or ten, would be a one unit change then.
 

CowboyBear

Super Moderator
#9
If i enter age as a continous variable into forward stepwise logistic regression
Stop right there. Stepwise regression is pretty much always a bad idea (See here, here, and here).

To quote Andrew Gelman (last link above):
"Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once."
 
#11
I don't think stepwise is the issue here :)
I tried using ENTER instead of stepwise.
Age:
B=-0.003 p=0.443 OR=0.997 [0.989-1.005]

This will lead to a very different conclusion when dividing the Age to children and adults...
 

maartenbuis

TS Contributor
#12
The answer was given before, see #2: The effect of age is non-linear. Going from 1 to 2 years is something else than going from 14 to 15 years or 42 to 43 years or 91 to 92 years. If you add age linearly to your model you assume that all these 1 year increments all have the same effect. Unsurprisingly that is almost always not true.

Also you should expect a large difference in coefficient/odds ratio because the unit of your variables are radically different: age compares people 1 year appart, while kids compares kids with adults. You obviously cannot compare these results directly.
 

CowboyBear

Super Moderator
#13
I don't think stepwise is the issue here :)
I tried using ENTER instead of stepwise.
Age:
B=-0.003 p=0.443 OR=0.997 [0.989-1.005]

This will lead to a very different conclusion when dividing the Age to children and adults...
I'm not saying that the use of stepwise regression is to blame for age having a non-linear effect. I'm just saying that you would be better off selecting your model using a method other than stepwise regression.
 

hlsmith

Omega Contributor
#15
Yeah it is a process. It took me 8 hours to build a model the other day (testing assumptions, interactions, random effects, model fit, parsimony selection). If a just ran an automated process it would not have known the relationships in those data or content.