best modeling approach?

#1
I have a data set with about 5000 pts and trying to create a risk prediction model to predict binary outcome that does not have competing risks (alive / dead). I would like to use an automatic selection method and this is what I need advice on (like LASSO, ridge, decision tree, random forest? ... if someone wants to suggest stepwise be interested).

- The goal is not to understand biology (eg test hypothesis and adjust for confounders) but rather to create a risk prediction model.
- It is longitudinal data
- Events are somewhat sparse (~ 5-10% population will have event).
- There are probably 40-50 candidate variables, likely will satisfy proportional hazards assumption.

Any advice on what approach / how to go about this? Also how would you validate the suggested approach.

Really appreciate any practical guidance here.
 

hlsmith

Omega Contributor
#2
What do you plan to do with the results??

If you just want a program to do everything for you, randomforestSRC in R seems like an option. You might want to hold out a random sample to validate it on. Not familiar with a stepwise lasso but that may be a better approach for you.
 
#3
Basically the goal is to predict risk of death after an intervention.

Thank you to the pointer to the package. I'll check it out. basically sounds like you are recommending random forest modeling.

Thanks again. Any other ideas appreciated.
 

hlsmith

Omega Contributor
#4
Well, I was also trying to lay some undertones on my dislike for automated approaches. They do not take into account actual knowledge of the data context (e.g., example dichotomizing age as an original continuous predict and losing info) and they run the risk of over fitting data. If you only have about 40 real predictors, it may be worth it to spend a day or two doing this based on a traditional survival model approach as well.


My hesitation comes back to how results may actually be utilized in the future.
 
#5
Well, I was also trying to lay some undertones on my dislike for automated approaches. They do not take into account actual knowledge of the data context (e.g., example dichotomizing age as an original continuous predict and losing info) and they run the risk of over fitting data. If you only have about 40 real predictors, it may be worth it to spend a day or two doing this based on a traditional survival model approach as well.


My hesitation comes back to how results may actually be utilized in the future.
Agree automatic methods can certainly be problematic, however this seems to be a reasonable problem for this. Consideration here is a no one will use a risk algorithm that requires 40 variables, it just isn't practical. So a parsimonious model really has some value from a practical standpoint.

So after quite a bit of research, I think Bayesian model averaging (https://cran.r-project.org/web/packages/BMA/BMA.pdf) may be the best here. It seems also to select the same variables as LASSO from what I read online .

Thanks again ... may try random forest modeling but I don't have a ton of experience there but certainly is cool.
 

hlsmith

Omega Contributor
#6
So you found a BMA model for survival? What program and what does it compare based on, accuracy?

Thanks for the update!