Transporting LASSO Model Results to Logistic Reg for Estimates

I have a dataset which will be entirely made up of binary variables (i.e., ~25 IVs, 1 DV). Variables will likely be correlated. N=~300.

-I was planning to split the dataset into a training (60%) and test (40%) set. Then I will run glmnet's lasso (in R) on the training dataset using CV k-folds and Friedman's SE rule for model selection.

-Next, I was going to use the above model's final predictor set in an exact logistic regression model to get estimates.

-Finally, I was going to score the test dataset using the logistic model coefficients.

Does anyone see any harm in doing this? My issue is that LASSO won't give me interpretable estimates for clinical use, so I switch modeling approaches. I don't know if this would be frowned upon, since my final estimates won't be penalized, but they will be selected through a penalized process. Comments and feedback appreciated.

P.S., I am just planning this in my head for now. So I won't actually have time to run it for a couple of weeks. Also, I have read a little about adalasso, which I could incorporate into the first model building part (training) if it is currently available for use with a binary outcome.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

So step 1 you're going to search for a model using the training set. In step 2, you're going to use that model covariates to fit the final model. On which data set is this going to be fit to? I'm thinking there should be a third data set for this purpose: one for training the model selection, one for fitting the final model, and one for testing the predictive accuracy (the final hold-out sample).

I'm not sure about how to handle penalization you're concerned with. In step 2 I could imagine you could do any further regularization with this third data set (k-folds) to deal with it there, leaving a final hold-out sample for measuring predictive accuracy. Maybe split your current test set in 2? (60/20/20)

You should definitely use jQuery. It's really great and does all things.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

Thanks for the feedback. The training set will be used for LASSO and Logistic. I am not sure I need the third dataset, since I don't plan to perform anymore model building/development after the LASSO, I am just using the final best subset of predictors from LASSO in a logistic model then scoring the second set. If I had a third set I would be applying the model to a new set then scoring a third set, which may better depict the general accuracy, but I don't think it is necessary.

I know some people split datasets into Training, Testing, and Validation, but what do they actually do in the Testing set. I guess, I could test the variables in it, but I don't plan to do any additional tweaking. I feel that logic alone should hold for that component of my plan. Secondarily, which I wouldn't call a researcher degree of freedom, but a limitation of the project size - is that the outcome is rare. I can't recall offhand, but it may be 5-10%. So I think a second split could result in worse prediction due to over-parameterization given the rarity of the outcome's prevalence.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

The problem I see is that you're doing 2 different things with the same data set, where the 2nd step is dependent on the first step.

You're using LASSO for feature selection based on some data set that describes a phenomena you want to build a predictor for. Great, but now that you used that data set for feature selection, you wouldn't want to then use it for building your predictive model, which is what you're doing.

My point is use your 60% (or less) to k-fold CV a selected model that tells you which covariates among the possible covariates to include in your model. This simply informs which features to use in your predictive model. It is not, say, selecting which model to use among a number of fitted models, which is a different situation altogether.

Once you know which features you're going to use, you fit your predictive model. But you don't want to tune it on the data which informed which features were to be used in the model. This is where you can use another training data set (training the predictive model, not feature selection). Once you've fit that model, then you can validate it against a hold-out sample for testing error. Thus, you've cleanly separated the tasks of feature selection, model fitting, and validation, each using independent data sets to measure them. That make sense?

You should definitely use jQuery. It's really great and does all things.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

Yes that makes sense, but I just want to skip the tuning of the prediction model and apply the features. The tuning of the prediction model just feels like an overkill.

What about doing the feature selection with LASSO and then applying the features to the 40% holdout set? So skip the scoring using the features applied to the initial 60%.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

hi,
couldn't you use a fundamentally different method for feature selection, like a classification tree or even a forest? Then use the features in building a logistic model?

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

rogojel,

Thanks for the reply. That is primarily what I am doing but using LASSO instead. I had thought of using a tree based approach, but due to the correlation in data and trees being focused on splits (interactions), a penalization model seemed more apt for my data and purpose. I just didn't know if my above approach seemed to pass the general intuition test and make sense to others without setting off red flags.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

Originally Posted by hlsmith

Yes that makes sense, but I just want to skip the tuning of the prediction model and apply the features. The tuning of the prediction model just feels like an overkill.

What about doing the feature selection with LASSO and then applying the features to the 40% holdout set? So skip the scoring using the features applied to the initial 60%.

I wouldn't say it's overkill. It's eliminating any potential overfitting to that data since your choice of features was based on a given X and now you're fixing your estimates to the same X. Maybe it has no gain in the predictive accuracy as estimated by the test error, but I would avoid it. Let your estimates come from fitting to data that model hasn't been used on yet. Then test it against new data.

You should definitely use jQuery. It's really great and does all things.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

Yes, I heard Trevor Hastie say that getting SEs is a work and progress and you can bootstrap.

Though, would that mean running an initial LASSO on 60% to get features. Then running a new LASSO model on the 40% and force LASSO to only use the predefined features, then bootstap this second model. That seems feasible enough. Though, I haven't done bootstrapping in R for models before (writing the code myself, not cutting-n-pasting), so I would require assistance in that regard.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

Bootstrapping is pretty easy. You can roll your own loop or use the boot package boot function. The main idea is that you take a random sample with replacement from your data set of the same size and run the model. Save the estimates. Repeat a bunch of times. Now you can look at the distribution of those estimates to see where the 95% cutoff is in the tails. There is your interval.

If the data is not too large (i.e., you can replicate the data set K times), then a simple approach is to pre-fetch all your samples. Or as I do here, just sample the row number indexes (note: you could just do the sampling in the fitting process itself instead of 2 steps as I do here. I like to separate tasks, personally).

Here is an easy approach. I have 2 functions. The index function takes a size N and randomly samples with replacement a sequence of row numbers. We lapply to iterate 1000 bootstrap samples (indexes). Then we lapply these sample indexes passing the data set to our model fitting function, returning in this case the coefficients of the model. This is done the K=1000 times very naturally using lapply. We can then access the vector of fits for each coefficient using sapply and the accessor "[" function. Easy.

In my run of mtcars I get mpg = 37.285 - 5.344*wt

The mean of my bootstrap was 37.44509 and -5.41729, respectively with CIs (33.12, 42.51) and (-7.03, -4.18), respectively. Compare with confint results on that mtcars model (33.45, 41.12) and (-6.49, -4.20).

I'd read up on examples of how to use the boot function in R. It also computes a number of types of confidence intervals from your bootstrapping (I believe what I did above is equivalent to percentile bootstrapping ci?). See Table 11.9 onward here; I roll my own and then show the boot example: http://rpubs.com/bryangoodrich/5225

You should definitely use jQuery. It's really great and does all things.

Re: Transporting LASSO Model Results to Logistic Reg for Estimates

thanks, i will check this out later tonight, it should help. i am very familiar with the bootstrap in SAS, but i have only used it a couple times in R, where a package called on the boot package and did all the wok for me.