Model selection versus variable selection


New Member
For my research I have 3 sets of different data. The aim is to evaluate the added prognostic value for survival of 2 sets to a baseline model (set 1).
Set 1 (baseline model): 3 variables
Set 2: 10 variables (evaluated as group and individually)
Set 3: 3 variables (evaluated as group and individually)
Cox models were ranked based on their global fit with AIC. Based on this I can tell that set 2 and set 3 add prognostic information to the baseline model (set 1). Also that set 3 adds prognostic information to the baseline model extended with set 2 individually and more so in group.

I also conducted classical multivariate analysis. So, for example to see the HR (95%CI) of the variables in set 3 while controlling for the variables in set 2 and set 1. From set 3 only 1 variable remained significant. I also performed a stepwise analysis (back method in R, so based on AIC) (I know that statisticians don’t like stepwise) to select the best subset of predictors and this showed that none of the variables of set 3 were retained.
Conclusions of both approaches (model selection vs variable selection) seem to contradict each other (variables in set 3 are interesting because add prognostic information vs not important in variables selection). Which exact questions do both approaches answer? Because there is a small but important difference, just at look my results. Or any other clarification?

Any help would be much appreciated.


New Member
Perhaps I didn't use the correct terminology. For all patients data was collected for 16 variables. So, it's one dataset.
I grouped the 16 variables in 3 groups (sets is perhaps not the correct term) for interpretation reasons. Group (set) 3 for example concern lab values, while the other 2 groups contain another type of information.


New Member
Ok, So what I make of it.

1) model selection/comparison (Akaike Information Criterion) is good for ranking cox models.
Also interesting for ranking the relative importance of models with the singles variables
for example the comparison of model ~ABCD + E + F + G versus model ~ ABCD versus model~ABCD+E versus model~ABCD + F versus ... allows you to rank variables E,F, etc. and to measure the amount of added prognostic information of E and F to the model with ABCD.
Though it is not intended for variable selection or to select a best subset. The focus is on how good the model fit the data and thus on prediction.

2) multivariate analysis
The focus is on the variables. For example the HR of E adjusted to ABCD is HR: 1.2 (95%CI)
It could be a strong predictor, but a bad model. I don't think the reverse is possible (weak or no predictor and a good model).

Anyway that's how I see it.
Last edited: