Hi,
I'm working with a high-dimensional medical database, with detailed monthly medication reimbursement data as well as occurence of diverse medical outcomes, over several years (2010 to 2013). My database is composed of about 600 000 subjects.
My goal is to identify associations between being treated by some medication M and a given outcome O at the end of follow up (December 2013), using penalized regression methods. The goal will be to analyze a large number of medications simultaneously.
I would like to be able to identify :
- Immediate effects of exposure to the medication M on the outcome O : "Is being treated with the medication M 30 days before the end of follow up associated with the occurence of the outcome O in December 2013" ?
- Cumulative effects of exposure to the medication M on the outcome O :"Is being treated with the medication M 90 days, 60 days and 30 days before the end of follow-up associated with the occurence of the outcome O in December 2013" ? Or maybe is it 120 days, 90 days, 60 days and 30 days ? and so on ...
I'm planning to mimic the cumulative exposure using series of dichotomous variables -> is the subject exposed 120 days before -> Yes/No, 90 days before -> Yes / No ...
My main concern lies in the right methodology to use, notably to account for the correlation between my variables (especially those describing cumulative exposures). I'm considering the elastic-net, but I have some hesitation since I'm not in the p>N case, or the group Lasso, but it seems difficult to define right groups.
Thanks for any suggestions/advice !
Yohann
I'm working with a high-dimensional medical database, with detailed monthly medication reimbursement data as well as occurence of diverse medical outcomes, over several years (2010 to 2013). My database is composed of about 600 000 subjects.
My goal is to identify associations between being treated by some medication M and a given outcome O at the end of follow up (December 2013), using penalized regression methods. The goal will be to analyze a large number of medications simultaneously.
I would like to be able to identify :
- Immediate effects of exposure to the medication M on the outcome O : "Is being treated with the medication M 30 days before the end of follow up associated with the occurence of the outcome O in December 2013" ?
- Cumulative effects of exposure to the medication M on the outcome O :"Is being treated with the medication M 90 days, 60 days and 30 days before the end of follow-up associated with the occurence of the outcome O in December 2013" ? Or maybe is it 120 days, 90 days, 60 days and 30 days ? and so on ...
I'm planning to mimic the cumulative exposure using series of dichotomous variables -> is the subject exposed 120 days before -> Yes/No, 90 days before -> Yes / No ...
My main concern lies in the right methodology to use, notably to account for the correlation between my variables (especially those describing cumulative exposures). I'm considering the elastic-net, but I have some hesitation since I'm not in the p>N case, or the group Lasso, but it seems difficult to define right groups.
Thanks for any suggestions/advice !
Yohann