number of variables for a model

#1
I am now trying to build a cox model: h(t, beta) = h0(t)exp(X*beta)

I have selected some variables from the data, and for a category variable, if it contains N categories, I will set up N-1 dummy variables for it, if so, there are 26 variables in my model, is it too many for a model?

But if I just treat the category variable as factor, the number of variable would be about 15, it seems ok for a linear regression or logistic regression, but I am not sure if it is ok for a cox model since I never use it before.

Anybody can give me some advice?
Many thanks!
 
#2
depending on which software package you're using, the dummy variables may be made automatically from factors

whether that's too many parameters largely depends on total sample size, among other things like sample size within each strata defined by the factor levels, and the rarity of the event being modeled
 
#3
depending on which software package you're using, the dummy variables may be made automatically from factors

whether that's too many parameters largely depends on total sample size, among other things like sample size within each strata defined by the factor levels, and the rarity of the event being modeled
The advice is helpful, I will check more with the data as you say.
The sample size is about 40 million, and I will check the sample size of each strata so it should not be the problem. The rarity of the event is about 10% for the total sample.
 
#5
:eek::eek::eek:

40 million records
4 million events

yeah, that's big as far as analysis data sets go...


... you gotta tells us, please describe the situation where you have 40M records


(is this in public health??!! :D)
Thanks for your interest, it is not the public health data, but just the american loan data, I am trying to use the cox model to predict the status of a loan, i.e from "Deliquency" to "Liquidation", "Deliquency -> Deliquency" means survival and "Deliquency -> Liquidation" means dead, the event is about 10% of the total sample.

Is 20+ variables posibile for a cox model with this sample size, or the number of variables is too many so that I am just over-fitted the data? One option is that if there is N samples, the number of variables should not be over N/5, if so, my variables is much less than N/5. I just wanna to know that how many number of variables people offen use in cox model so I can have a general understanding.
 
#6
sorry it's not clear to me, but when you say there are 20+ variables, does that count come from counting all the dummy variables, or is it just counting the independent variables?

e.g.
Variable (No. Levels)
A (2)
B (3)

with the way you're counting variables, would you get:
no. variables: 2
no. variables: 3
 
#7
sorry it's not clear to me, but when you say there are 20+ variables, does that count come from counting all the dummy variables, or is it just counting the independent variables?

e.g.
Variable (No. Levels)
A (2)
B (3)

with the way you're counting variables, would you get:
no. variables: 2
no. variables: 3
I am sorry I should make it more clear. I may have selected out 17 independent variables: X1, ..., X17; but X1, X3, X5 are categorical variables so I transform X1, X3, and X5 to several dummy variables (dependent on how many levels X1, X3, and X5 have), then the totally number of variables become 30+, but not all the variables are significant, so the final variables I include in the model is about 25.
 
#8
But if I just treat the category variable as factor, the number of variable would be about 15, it seems ok for a linear regression or logistic regression, but I am not sure if it is ok for a cox model since I never use it before.
So it seems your sample size can handle lots of parameters, even your 26. I don't see that this would be more of a problem in Cox PH than in linear or logistic regression, but someone might be able to correct me. I'll mention this, though, even when the data is large enough to handle lots of parameters, this doesn't mean the Type I error rate is adjusted properly for all the coefficient significance tests, so I wouldn't be surprised if you find some screwy trends or something that doesn't make subject-domain sense if you do a model-building exercise (I've seen this before).

Let us know if there are updates