Lasso Regression, Repeated Measures

Hi everyone. I am new to this forum. I’m hoping someone can help me with this approach. In a dataset of ~200 participants, I have about 50 predictor variables and 1 outcome variable measured at two timepoints, all of which are continuous.

I would like to determine which predictors most strongly influence my outcome variable accounting for repeated measures of the two timepoints.

A factor analysis was suggested to me (as the first and final step for my purposes), yet this will not be related to my dependent variable in any way, to my understanding. Is it ever the case where I could do a FA, then use the independent predictor variables that made it into the first factor loading in a regression? I’m sure this introduces bias given the dependent variable wasn’t considered in the FA?

Otherwise, perhaps a penalized lasso or elastic net regression or random forest would be an improved approach here? But also to my understanding, these may not be able to account for repeated measures. In this case, is it possible or even recommended to conduct a lasso/elastic net/RF on each timepoint separately, then choose the common predictors between the timepoints to include in my second step (regression)?

Is there another statistical approach you would take here? Many thanks!


Less is more. Stay pure. Stay poor.
Repeated collection of the covariates and the dependent variable at the same time or not? So di you have two values for the DV and when were they collected?
I have one value for each of the 50 independent variables at each time point. Measures were collected at month 12 and month 24. I also have one dependent variable collected at month 12 and month 24. Interesting in looking at the strongest predictors among these independent variables of my dependent variable.


Less is more. Stay pure. Stay poor.
I haven't seen repeated measures LASSO, but there are many newer versions of it so I wouldn't be surprised if some one has adapted it. If data are collected at cross-sections, how do you know they are independent of the DV? Also, since there are two different time points could the relationship between the IVs and DV change making running two LASSO models not out of the question?

So you should standardize all the variables and run two models, perhaps. However, given your sample size you may not have sufficient data to create a holdout set for model validation. Also, estimates from LASSO are considered selection biased. So the variable rankings are correct, but SE estimates may be off since the selection model is used to chose terms and provide their estimates. There is a something called selective inference to attempt to recover estimates with correct precision coverage.

As you mentioned if you are just looking for variable importance, RF is another option. If you had enough data you could run both approaches and validate with holdout data for final selection.
Definitely. I think running two lasso models wouldn’t be out of the question since the relationships may be different, but the idea behind the repeated measures was to add power to our dataset.

Will have to give the holdout set more thought. I think an option I could do is run the two lasso models and see whether there is heterogeneity in the estimates. If not, I could then run repeated measures generalized linear mixed models with the variables resulting in the strongest estimates. Would selection interference that you mentioned replace this last step?

Thank you very much for your help!


Less is more. Stay pure. Stay poor.
Yeah, I guess I recall seeing session at the last Joint Statistical Meeting on hierarchical LASSO, which wouldn't be too removed from repeated measures. Let us know how the process goes. One comment, LASSO models don't know if you are putting mediators, colliders, or extraneous terms in the model - it's a golem the will run on anything, so it is up to you to make sure the wrong initial features don't make it into the model.