Thread: Variable selection with longitudinal and correlated data

1. Variable selection with longitudinal and correlated data

Hi,

I'm working with a high-dimensional medical database, with detailed monthly medication reimbursement data as well as occurence of diverse medical outcomes, over several years (2010 to 2013). My database is composed of about 600 000 subjects.

My goal is to identify associations between being treated by some medication M and a given outcome O at the end of follow up (December 2013), using penalized regression methods. The goal will be to analyze a large number of medications simultaneously.

I would like to be able to identify :

- Immediate effects of exposure to the medication M on the outcome O : "Is being treated with the medication M 30 days before the end of follow up associated with the occurence of the outcome O in December 2013" ?

- Cumulative effects of exposure to the medication M on the outcome O :"Is being treated with the medication M 90 days, 60 days and 30 days before the end of follow-up associated with the occurence of the outcome O in December 2013" ? Or maybe is it 120 days, 90 days, 60 days and 30 days ? and so on ...
I'm planning to mimic the cumulative exposure using series of dichotomous variables -> is the subject exposed 120 days before -> Yes/No, 90 days before -> Yes / No ...

My main concern lies in the right methodology to use, notably to account for the correlation between my variables (especially those describing cumulative exposures). I'm considering the elastic-net, but I have some hesitation since I'm not in the p>N case, or the group Lasso, but it seems difficult to define right groups.

Yohann

2. Re: Variable selection with longitudinal and correlated data

So you have a rare outcome?

"The goal will be to analyze a large number of medications simultaneously" You mean medication durations?

An issue would be not every person will have the same length of follow-up during the time period, unless you use a follow-up of say 120 days and not an end date.

Your question seems to scream proportion hazards model, why not run a survival model?

I think you are probably fine with the categorical variables in the model, however, the standard errors may get inflated due to colinearity of variables. if a combination of the variables could be summed to equal other variables you would have issues or if there is not a difference between the number of outcomes between the 60 and 90 day increments, but the first issue should not be the case here, however if the outcome is rare the second scenario could happen

3. Re: Variable selection with longitudinal and correlated data

Hi hlsmith,

I'm not yet sure about the frequencies of the outcomes I will study but there are not likely to be very rare.

The main goal of my study is to perform an exploratory analysis to identify potential side effects of a large number of medications in a "real life" framework from my national reimbursements database. I was initially planning to analyze several treatments simultaneously (hence the use of penalized regression) but it's probably better to perform 1 analysis/treatment to avoid an overcomplex model.

Why do you suggest to use a survival model ? Since I will work with outcome at a fixed date I think that a penalized linear/logistic model would be more appropriated.

Yes, you are right regarding the issue of colinearity, I think that methods such as LASSO suffer from this problem.

Yohann

 Tweet