Issue defining ordinal variables in SAS

#1
Hello All,

I am developing a Logistic Regression model in SAS and I have one of the independent variables as Age. I do not have exact age but I have the age buckets. Say 26 Years-30 years as bucket 1, 31 years to 35 years as bucket 2, 36 years to 40 years as bucket 3, and so on. If I include these bucket numbers in the model, then SAS considers it as continuous numeric variable. I want to make SAS understand that these buckets are ordinal. I did not get any useful information regarding this online.

In case I consider this ordinal data as numeric, is it a big mistake?

Can you please help me out.

Thanks a lot in advance :)

Regards,

Raghu
 

hlsmith

Not a robit
#2
This variable is an independent variable correct?

Just put it in the class statement and it will be treated as a categorical variable. You should not have to but you may need to set your reference group. You can writed a contrast statement to see if it has a linear effect, odds go up or down when age group goes up or down.
 
#3
Hello Smith,
Thanks a lot for your response!
Yes, the variable is an independent variable.
Smith, I don't want the age buckets numbers to be considered as categorical variable. As we know,An ordinal variable has got a little more information in it in the sense that different values(levels) inside an ordinal variable follow a particular order. For example, If I have departments 1,2,3,etc for say 1 for HR,2 for Marketing,3 for Finance, etc. Then Considering department as a categorical variable makes sense because there is no ordering in the departments. i.e. We can't say if HR > Marketing or Finance>HR, etc.
In my case The age bucket numbers have ordering in them in the sense that the age of a person in bucket 3 is definitely more than the age of a person lying in buckets 1 or 2 but we can't quantify the relation. But we are definitely sure of the order. By Considering age buckets as categorical, I will be losing out the information related to the ordering of the buckets, which I do not want to.
Can you please clarify?

Thanks a lot in advance!

Best Regards,
Raghu
 

noetsi

Fortran must die
#4
Ordinal variables are categorical, I think you mean you don't want it to be considered a nominal variable which is also a categorical variable.

I don't know which method you are using but if its regression or anova you are going to have to use a dummy variable anyhow. And it does not matter as far as I know if a dummy variable is ordinal or nominal. Your argument makes more sense, you don't want to lose information, if you variable was the DV.

It is not generally a good idea to have a IV be a non-dummy ordinal variable because SAS is going to assume it is interval in nature and because it can distort interpretation of the slopes.
 

hlsmith

Not a robit
#5
Far as I know IV go in as continuous or categorical in SAS reg. So you enter it in the class statement and if you want to see if there is a linear trend, you create a contrast like in the attached link. I believe this is generalizable to logistic reg. But if it doesn't work like me know. Also, if you don't have four categories, you will have to alter the statement.


In the full model, you will loss a little info, like you mentioned. The other option would be to not include it in the class statement. Which would seem like a worst option.


http://support.sas.com/kb/22/912.html
 
#6
@Noetsi Thanks for your suggestions!
@Smith The link you shared seems interesting. I have 12 age buckets of 5 years each. I will go through the concept and get back to you.

Thanks once again! :)
 

noetsi

Fortran must die
#7
It is not supposed to matter what form the IV takes according to many text. And you will see in the literature a likert scale variable with say 50 levels used as a 50 level IV. But, at least according to a very smart professor I had, the text are wrong. Using a likert variable this way (really any way but dummies I assume) will create interpretation problems and may lead to nonsensical results.
 

hlsmith

Not a robit
#8
I cant remember your context, but would the Cochran - Armitage trend test work for you? Or do you have covariates to control for?
 
#9
Hello Noetsi/Smith,
After discussing with few more people, I have decided to create a binary variable at desired cutoff. Say, if my cutoff age is 30 years then I would assign '0' to all people with age less than or equal to 30 years and '1' to people above 30 years. I will check the significance of this variable in my model.
Similarly, I will create few more variables at cutoff age of 40 years,50 years, etc. I will enter only one of these three at a time in the model and check for significance. If all the three cutoff values show a good significance, then I would choose the one which reduces AIC/BIC the most.
The downside of this approach is that I am not able to utilize the full information available on the age buckets(Percentage events reducing across buckets), and limiting it to a binary variable.

Let me have your comments.

Thanks,
Raghu
 

noetsi

Fortran must die
#10
Another downside is that you are supposed to have a theoretical basis for choosing a specific cutoff (that is which variable to use). Trying one by one and choosing the one that generates the highest AIC is a doubtful way to pick a variable. Among other problems is might increase the chances of making a type one error. There is always a chance, reflected by alpha, of saying something is significant when it is not. Or of saying a model is significant when it is not. I believe these chances will increase as you chose one variable after another to add to your model that way.

But this is a grey area so you should ask others what they think.
 
#11
I think you are making too much out of this. Do you have reason to believe that there is an not linear effect between the outcome and age?


Traditionally, a person would use a receiver operator curve to find the cutoff to optimize the classification of the outcome. There is some literature that talks about correcting for false discovery, but most do not. Another way to examine cutoffs is a classification tree. Lastly, if you just want to control for the variable and are not as interested in trying to interpret age's effects on the outcome, then you could incorporate a spline(age) term into the model. This may require using a general additive model (GAM).
 

noetsi

Fortran must die
#12
If you think time is having an impact on the DV and is continuous just creating a variable that counts [that starts at 1 for a given year and goes up one year every year after that] should work. If you think it is not linear you could square these values and add it to the model as a quadratic term [which is often done for regression models that are using time].

If instead you think that there is a certain age below which impact varies it might make more sense to speak of a cutoff although you could model this as a count of years since birth just as easily. The cutoffs tend to be artificial I suspect done because readers prefer to analyze buckets rather than one year at a time. Also there is a real issue with structural breaks given changes over time. In the sixties people under 30 might have been more liberal than those over 30. Now the reverse might be true. If you are going to do this type of analysis you really should do a Chow test for structural breaks.