Survival Analysis: Cox Proportional Hazards with 4000 indicator variables

Is using 4000 indicator variables in Cox PH a valid approach?

  • Yes, there is no difference in running 2 indicators versus 4000

    Votes: 0 0.0%
  • Yes, but with some wariness

    Votes: 0 0.0%
  • Maybe/maybe Not

    Votes: 0 0.0%
  • Not Valid, but try it anyway

    Votes: 0 0.0%

  • Total voters
THE SITUATION: I am preparing a 420,008 observation Cox Proportional Hazards Regression model with 4001 indicator variables. The 4001st indicator is my baseline. I am using STATA to accomplish this task, but STATA/IC's maximum number of indicators it can generate is set at 2000. Further, the largest matrix it can generate is [n,800]. So, to do this analysis, I created a .do file that runs the Cox Regression (stcox) on 6 batches of 666 indicators each.

1. Since I am comparing the individuals in the first 4000 records to the baseline (the4001st), I am not concerned with my coefficients. However, does running the Cox model in 6 different batches affect the p-values of the individual indicator coefficients? That is, would my p-values retain their level significance if run as a single file rather than as 6 batches? Likewise, would the coefficients retain their p-values if run in 4000 batches?

2. Is there a better way/software that can accomplish this task without batching the process?

3. This is an unconvential use of Cox Regression. Does anyone have an opinion regarding the validity of the process or interpretation of the output?


New Member
I'm not an expert, but one of the assumptions of a Cox model is that each tested variable independently contributes to a change in the dependent variable. So one question is if your model is theoretically sound or if should be seen as way do dredge your data for significant results.
Naturally, with so many tested variables, you will risk coming up with lots of false positive results.
What significance level do you plan to use?
Good point about that Cox assumption. I am new to it myself, but had come to a similar conclusion about the original approach.

Now, I am conceptualizing the indicators as criteria for stratifying the population and then running each of the 4000 population stratifications as 4000 individual experiments. In this sense, each of the stratifications, when run against the baseline, is treated as an independent evaluation. I am running the 4000 as a batch. It takes forever to run, but some of the output I have seen so far shows approximately 1 in 10 with p-values less that 0.1.