# Transporting LASSO Model Results to Logistic Reg for Estimates

#### hlsmith

##### Omega Contributor
I have a dataset which will be entirely made up of binary variables (i.e., ~25 IVs, 1 DV). Variables will likely be correlated. N=~300.

-I was planning to split the dataset into a training (60%) and test (40%) set. Then I will run glmnet's lasso (in R) on the training dataset using CV k-folds and Friedman's SE rule for model selection.

-Next, I was going to use the above model's final predictor set in an exact logistic regression model to get estimates.

-Finally, I was going to score the test dataset using the logistic model coefficients.

Does anyone see any harm in doing this? My issue is that LASSO won't give me interpretable estimates for clinical use, so I switch modeling approaches. I don't know if this would be frowned upon, since my final estimates won't be penalized, but they will be selected through a penalized process. Comments and feedback appreciated.

P.S., I am just planning this in my head for now. So I won't actually have time to run it for a couple of weeks. Also, I have read a little about adalasso, which I could incorporate into the first model building part (training) if it is currently available for use with a binary outcome.

#### bryangoodrich

##### Probably A Mammal
So step 1 you're going to search for a model using the training set. In step 2, you're going to use that model covariates to fit the final model. On which data set is this going to be fit to? I'm thinking there should be a third data set for this purpose: one for training the model selection, one for fitting the final model, and one for testing the predictive accuracy (the final hold-out sample).

I'm not sure about how to handle penalization you're concerned with. In step 2 I could imagine you could do any further regularization with this third data set (k-folds) to deal with it there, leaving a final hold-out sample for measuring predictive accuracy. Maybe split your current test set in 2? (60/20/20)

#### hlsmith

##### Omega Contributor
Thanks for the feedback. The training set will be used for LASSO and Logistic. I am not sure I need the third dataset, since I don't plan to perform anymore model building/development after the LASSO, I am just using the final best subset of predictors from LASSO in a logistic model then scoring the second set. If I had a third set I would be applying the model to a new set then scoring a third set, which may better depict the general accuracy, but I don't think it is necessary.

I know some people split datasets into Training, Testing, and Validation, but what do they actually do in the Testing set. I guess, I could test the variables in it, but I don't plan to do any additional tweaking. I feel that logic alone should hold for that component of my plan. Secondarily, which I wouldn't call a researcher degree of freedom, but a limitation of the project size - is that the outcome is rare. I can't recall offhand, but it may be 5-10%. So I think a second split could result in worse prediction due to over-parameterization given the rarity of the outcome's prevalence.

But I am open for debate!

#### bryangoodrich

##### Probably A Mammal
The problem I see is that you're doing 2 different things with the same data set, where the 2nd step is dependent on the first step.

You're using LASSO for feature selection based on some data set that describes a phenomena you want to build a predictor for. Great, but now that you used that data set for feature selection, you wouldn't want to then use it for building your predictive model, which is what you're doing.

My point is use your 60% (or less) to k-fold CV a selected model that tells you which covariates among the possible covariates to include in your model. This simply informs which features to use in your predictive model. It is not, say, selecting which model to use among a number of fitted models, which is a different situation altogether.

Once you know which features you're going to use, you fit your predictive model. But you don't want to tune it on the data which informed which features were to be used in the model. This is where you can use another training data set (training the predictive model, not feature selection). Once you've fit that model, then you can validate it against a hold-out sample for testing error. Thus, you've cleanly separated the tasks of feature selection, model fitting, and validation, each using independent data sets to measure them. That make sense?

#### hlsmith

##### Omega Contributor
Yes that makes sense, but I just want to skip the tuning of the prediction model and apply the features. The tuning of the prediction model just feels like an overkill.

What about doing the feature selection with LASSO and then applying the features to the 40% holdout set? So skip the scoring using the features applied to the initial 60%.

#### rogojel

##### TS Contributor
hi,
couldn't you use a fundamentally different method for feature selection, like a classification tree or even a forest? Then use the features in building a logistic model?

regards

#### hlsmith

##### Omega Contributor
rogojel,

Thanks for the reply. That is primarily what I am doing but using LASSO instead. I had thought of using a tree based approach, but due to the correlation in data and trees being focused on splits (interactions), a penalization model seemed more apt for my data and purpose. I just didn't know if my above approach seemed to pass the general intuition test and make sense to others without setting off red flags.

#### Dason

Can you elaborate on this:
My issue is that LASSO won't give me interpretable estimates for clinical use

#### hlsmith

##### Omega Contributor
Yes, I suppose I am referencing the lack of confidence intervals for inference.

#### bryangoodrich

##### Probably A Mammal
Yes that makes sense, but I just want to skip the tuning of the prediction model and apply the features. The tuning of the prediction model just feels like an overkill.

What about doing the feature selection with LASSO and then applying the features to the 40% holdout set? So skip the scoring using the features applied to the initial 60%.
I wouldn't say it's overkill. It's eliminating any potential overfitting to that data since your choice of features was based on a given X and now you're fixing your estimates to the same X. Maybe it has no gain in the predictive accuracy as estimated by the test error, but I would avoid it. Let your estimates come from fitting to data that model hasn't been used on yet. Then test it against new data.

#### hlsmith

##### Omega Contributor
Yes, I heard Trevor Hastie say that getting SEs is a work and progress and you can bootstrap.

Though, would that mean running an initial LASSO on 60% to get features. Then running a new LASSO model on the 40% and force LASSO to only use the predefined features, then bootstap this second model. That seems feasible enough. Though, I haven't done bootstrapping in R for models before (writing the code myself, not cutting-n-pasting), so I would require assistance in that regard.

#### bryangoodrich

##### Probably A Mammal
Bootstrapping is pretty easy. You can roll your own loop or use the boot package boot function. The main idea is that you take a random sample with replacement from your data set of the same size and run the model. Save the estimates. Repeat a bunch of times. Now you can look at the distribution of those estimates to see where the 95% cutoff is in the tails. There is your interval.

If the data is not too large (i.e., you can replicate the data set K times), then a simple approach is to pre-fetch all your samples. Or as I do here, just sample the row number indexes (note: you could just do the sampling in the fitting process itself instead of 2 steps as I do here. I like to separate tasks, personally).

Code:
index <- function(i, N) sample(1:N, N, TRUE)
fit <- function(idx, x) coef(lm(mpg ~ wt, x[idx, ]))

x <- mtcars
samples <- lapply(1:1000, index, N=nrow(x))
fits <- lapply(samples, fit, x=x)

quantile(sapply(fits, "[", 1), probs = c(0.025, 0.975))  # Intercepts
quantile(sapply(fits, "[", 2), probs = c(0.025, 0.975))  # Beta1
Here is an easy approach. I have 2 functions. The index function takes a size N and randomly samples with replacement a sequence of row numbers. We lapply to iterate 1000 bootstrap samples (indexes). Then we lapply these sample indexes passing the data set to our model fitting function, returning in this case the coefficients of the model. This is done the K=1000 times very naturally using lapply. We can then access the vector of fits for each coefficient using sapply and the accessor "[" function. Easy.

In my run of mtcars I get mpg = 37.285 - 5.344*wt

The mean of my bootstrap was 37.44509 and -5.41729, respectively with CIs (33.12, 42.51) and (-7.03, -4.18), respectively. Compare with confint results on that mtcars model (33.45, 41.12) and (-6.49, -4.20).

I'd read up on examples of how to use the boot function in R. It also computes a number of types of confidence intervals from your bootstrapping (I believe what I did above is equivalent to percentile bootstrapping ci?). See Table 11.9 onward here; I roll my own and then show the boot example: http://rpubs.com/bryangoodrich/5225

#### bryangoodrich

##### Probably A Mammal
Replicating what I did above using the boot package

Code:
library(boot)
fit <- function(data, index, formula, ...) coef(lm(formula, data = data[index, ]))
fits <- boot(mtcars, fit, formula = mpg ~ wt, R = 1000)

# Defaults to first variable (index=1)
boot.ci(fits, type = c("norm", "basic", "perc"))  # I get errors with type = "stud"
[COLOR="blue"]# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 1000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = b, type = c("norm", "basic", "perc"))
#
# Intervals :
# Level      Normal              Basic              Percentile
# 95%   (32.65, 41.73 )   (32.62, 41.64 )   (32.93, 41.95 )
# Calculations and Intervals on Original Scale
[/COLOR]
# For the slope coefficient (index = 2)
boot.ci(fits, type = c("norm", "basic", "perc", index = 2)
[COLOR="blue"]# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 1000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = b, type = c("norm", "basic", "perc"), index = 2)
#
# Intervals :
# Level      Normal              Basic              Percentile
# 95%   (-6.668, -3.916 )   (-6.531, -3.717 )   (-6.972, -4.158 )
# Calculations and Intervals on Original Scale[/COLOR]

#### hlsmith

##### Omega Contributor
thanks, i will check this out later tonight, it should help. i am very familiar with the bootstrap in SAS, but i have only used it a couple times in R, where a package called on the boot package and did all the wok for me.

#### hlsmith

##### Omega Contributor
Alright, here is what I ended up doing. Please provide any feedback on whether this approach seemed wonky or not.

_________________________________________________________________

Study dataset had two binary outcomes of interest. I partitioned the set into a training and test (final) set (i.e., 60/40 percent split, since outcome was rare).

-Training set: used two LASSO models for feature selection for the two outcomes.

-Test (final) set: Modeled two Bayesian logistic models (specified normal priors) to get estimates for outcomes.

#### bryangoodrich

##### Probably A Mammal
Since it shouldn't be too much extra work, I'd also consider running a classification tree to compare what splitting variables it finds most important (a sort of feature selection); might be a good fit in this instance, since you're dealing with all binary variables. I'm no Bayesian wiz, but unless you have a reason for normal priors, I'd probably try a couple other priors to see how much things change (sensitivity analysis?).

#### hlsmith

##### Omega Contributor
My Methods write up for this is below, let me know if anyone has some feedback. Thanks, and I know this is kind of a boring study, but a good initial platform for me to learn from.

METHODS

The primary study focus of patient experience was measured using Consumer Assessment of Healthcare Providers and Systems (CAHPS®) Clinician and Group surveys. Data came from surveys returned between January 2014 through April 2017 for patients seen at a pediatric hematology-oncology outpatient clinic located in a Midwestern city. The clinic was a mid-size semi-private practice with five full-time physicians, no fellows or mid-level providers, who saw approximately sixty to seventy new oncology patients per year, as well as a significant volume of patients with a variety of hematologic conditions. Other clinic staff included four nurses, two patient access associates, three research assistants, one patient care technician, two social workers, one child life specialist, one psychologist, one nurse coordinator, and one scheduler.

The CAHPS® instrument included 25 items scored on a 5-point Likert scale in addition to a “Rate This Provider” 0 to 10 point scale. The primary study outcome was to determine predictors of a Top-Box score for “Rate This Provider” (defined as score of 9 or 10) and predictors of “Likelihood of Your Recommending This Practice to Others” (defined as score of 5, “Very Good”). Additional survey questions were also dichotomized based on a response of “Very Good” or not. Patient demographic information was not provided due to the anonymity of the surveying process, though the physician provider seen at the visit was available.

Analysis
Data was preliminarily reviewed to examine item response missingnesss and need to control for patients nested in providers. Approximately 0% to 2% of individual response question data were missing with no discernible patterns (e.g., missing response combinations or monotonicity); with these data classified as missing completely at random. Approximately 6% to 8% of data were missing for the two questions on visit delays and patient medications. The greater missingness rates in these questions were assumed to be related to respondents interpreting questions as “not applicable” if there were no visit delays or patient medications. An empty multilevel model was used to examine the need to control for provider level random effects. Model results revealed that the provider level did not explaining a significant amount of response variability in regards to study outcomes. As a result analyses were based on complete case data without the use of random effects.

Study data were partitioned into two random subsets based on a 60/40 split. The larger subset was used for predictor selection related to the two study outcomes using least absolute shrinkage and selection operator (LASSO) based on multiple logistic regression (glmnet package in R, Vienna Austria, see supplemental Figure Sa and Sb), due to possible multicollinearity in item responses and sparse outcome concerns. Predictors selected via LASSO were then used within the smaller dataset for estimate calculation using Bayesian logistic regression (PROC GENMOD, DIST=BIN, BAYES option in SAS 9.4, Cary NC). Models incorporated normal prior distributions (mean = 1; variance = 0.5) using multiple chain Monte Carlo (i.e., 3 chains, thinning = “3”) with 100,000 iterations and discarding first 50,000 as burn-ins. Model convergence was monitored and checked via traces plots, autocorrelation, posterior samples, and Gelman-Rubin statistics and Raftery-Louis diagnostics. Results were presented using odds ratios (OR) with 95% credible intervals (CI). A sensitivity analysis was conducted by rerunning the Bayesian analyses with non-informative prior distributions. Institutional Review Board approval was obtained for this study.