+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 18

Thread: Transporting LASSO Model Results to Logistic Reg for Estimates

  1. #1
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Transporting LASSO Model Results to Logistic Reg for Estimates




    I have a dataset which will be entirely made up of binary variables (i.e., ~25 IVs, 1 DV). Variables will likely be correlated. N=~300.


    -I was planning to split the dataset into a training (60%) and test (40%) set. Then I will run glmnet's lasso (in R) on the training dataset using CV k-folds and Friedman's SE rule for model selection.


    -Next, I was going to use the above model's final predictor set in an exact logistic regression model to get estimates.


    -Finally, I was going to score the test dataset using the logistic model coefficients.


    Does anyone see any harm in doing this? My issue is that LASSO won't give me interpretable estimates for clinical use, so I switch modeling approaches. I don't know if this would be frowned upon, since my final estimates won't be penalized, but they will be selected through a penalized process. Comments and feedback appreciated.


    P.S., I am just planning this in my head for now. So I won't actually have time to run it for a couple of weeks. Also, I have read a little about adalasso, which I could incorporate into the first model building part (training) if it is currently available for use with a binary outcome.
    Stop cowardice, ban guns!

  2. #2
    Probably A Mammal
    Points: 32,065, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,567
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    So step 1 you're going to search for a model using the training set. In step 2, you're going to use that model covariates to fit the final model. On which data set is this going to be fit to? I'm thinking there should be a third data set for this purpose: one for training the model selection, one for fitting the final model, and one for testing the predictive accuracy (the final hold-out sample).

    I'm not sure about how to handle penalization you're concerned with. In step 2 I could imagine you could do any further regularization with this third data set (k-folds) to deal with it there, leaving a final hold-out sample for measuring predictive accuracy. Maybe split your current test set in 2? (60/20/20)
    You should definitely use jQuery. It's really great and does all things.

  3. #3
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Thanks for the feedback. The training set will be used for LASSO and Logistic. I am not sure I need the third dataset, since I don't plan to perform anymore model building/development after the LASSO, I am just using the final best subset of predictors from LASSO in a logistic model then scoring the second set. If I had a third set I would be applying the model to a new set then scoring a third set, which may better depict the general accuracy, but I don't think it is necessary.


    I know some people split datasets into Training, Testing, and Validation, but what do they actually do in the Testing set. I guess, I could test the variables in it, but I don't plan to do any additional tweaking. I feel that logic alone should hold for that component of my plan. Secondarily, which I wouldn't call a researcher degree of freedom, but a limitation of the project size - is that the outcome is rare. I can't recall offhand, but it may be 5-10%. So I think a second split could result in worse prediction due to over-parameterization given the rarity of the outcome's prevalence.


    But I am open for debate!
    Stop cowardice, ban guns!

  4. #4
    Probably A Mammal
    Points: 32,065, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,567
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    The problem I see is that you're doing 2 different things with the same data set, where the 2nd step is dependent on the first step.

    You're using LASSO for feature selection based on some data set that describes a phenomena you want to build a predictor for. Great, but now that you used that data set for feature selection, you wouldn't want to then use it for building your predictive model, which is what you're doing.

    My point is use your 60% (or less) to k-fold CV a selected model that tells you which covariates among the possible covariates to include in your model. This simply informs which features to use in your predictive model. It is not, say, selecting which model to use among a number of fitted models, which is a different situation altogether.

    Once you know which features you're going to use, you fit your predictive model. But you don't want to tune it on the data which informed which features were to be used in the model. This is where you can use another training data set (training the predictive model, not feature selection). Once you've fit that model, then you can validate it against a hold-out sample for testing error. Thus, you've cleanly separated the tasks of feature selection, model fitting, and validation, each using independent data sets to measure them. That make sense?
    You should definitely use jQuery. It's really great and does all things.

  5. #5
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Yes that makes sense, but I just want to skip the tuning of the prediction model and apply the features. The tuning of the prediction model just feels like an overkill.

    What about doing the feature selection with LASSO and then applying the features to the 40% holdout set? So skip the scoring using the features applied to the initial 60%.
    Stop cowardice, ban guns!

  6. #6
    TS Contributor
    Points: 12,501, Level: 73
    Level completed: 13%, Points required for next Level: 349
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,491
    Thanks
    162
    Thanked 334 Times in 314 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    hi,
    couldn't you use a fundamentally different method for feature selection, like a classification tree or even a forest? Then use the features in building a logistic model?

    regards

  7. #7
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    rogojel,


    Thanks for the reply. That is primarily what I am doing but using LASSO instead. I had thought of using a tree based approach, but due to the correlation in data and trees being focused on splits (interactions), a penalization model seemed more apt for my data and purpose. I just didn't know if my above approach seemed to pass the general intuition test and make sense to others without setting off red flags.
    Stop cowardice, ban guns!

  8. #8
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Can you elaborate on this:
    My issue is that LASSO won't give me interpretable estimates for clinical use
    I don't have emotions and sometimes that makes me very sad.

  9. #9
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Yes, I suppose I am referencing the lack of confidence intervals for inference.
    Stop cowardice, ban guns!

  10. #10
    Probably A Mammal
    Points: 32,065, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,567
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Quote Originally Posted by hlsmith View Post
    Yes, I suppose I am referencing the lack of confidence intervals for inference.
    Bootstrap?
    You should definitely use jQuery. It's really great and does all things.

  11. #11
    Probably A Mammal
    Points: 32,065, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,567
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Quote Originally Posted by hlsmith View Post
    Yes that makes sense, but I just want to skip the tuning of the prediction model and apply the features. The tuning of the prediction model just feels like an overkill.

    What about doing the feature selection with LASSO and then applying the features to the 40% holdout set? So skip the scoring using the features applied to the initial 60%.
    I wouldn't say it's overkill. It's eliminating any potential overfitting to that data since your choice of features was based on a given X and now you're fixing your estimates to the same X. Maybe it has no gain in the predictive accuracy as estimated by the test error, but I would avoid it. Let your estimates come from fitting to data that model hasn't been used on yet. Then test it against new data.
    You should definitely use jQuery. It's really great and does all things.

  12. #12
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Yes, I heard Trevor Hastie say that getting SEs is a work and progress and you can bootstrap.


    Though, would that mean running an initial LASSO on 60% to get features. Then running a new LASSO model on the 40% and force LASSO to only use the predefined features, then bootstap this second model. That seems feasible enough. Though, I haven't done bootstrapping in R for models before (writing the code myself, not cutting-n-pasting), so I would require assistance in that regard.
    Stop cowardice, ban guns!

  13. #13
    Probably A Mammal
    Points: 32,065, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,567
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Bootstrapping is pretty easy. You can roll your own loop or use the boot package boot function. The main idea is that you take a random sample with replacement from your data set of the same size and run the model. Save the estimates. Repeat a bunch of times. Now you can look at the distribution of those estimates to see where the 95% cutoff is in the tails. There is your interval.

    If the data is not too large (i.e., you can replicate the data set K times), then a simple approach is to pre-fetch all your samples. Or as I do here, just sample the row number indexes (note: you could just do the sampling in the fitting process itself instead of 2 steps as I do here. I like to separate tasks, personally).

    Code: 
    index <- function(i, N) sample(1:N, N, TRUE)
    fit <- function(idx, x) coef(lm(mpg ~ wt, x[idx, ])) 
    
    x <- mtcars
    samples <- lapply(1:1000, index, N=nrow(x))
    fits <- lapply(samples, fit, x=x)
    
    quantile(sapply(fits, "[", 1), probs = c(0.025, 0.975))  # Intercepts
    quantile(sapply(fits, "[", 2), probs = c(0.025, 0.975))  # Beta1
    Here is an easy approach. I have 2 functions. The index function takes a size N and randomly samples with replacement a sequence of row numbers. We lapply to iterate 1000 bootstrap samples (indexes). Then we lapply these sample indexes passing the data set to our model fitting function, returning in this case the coefficients of the model. This is done the K=1000 times very naturally using lapply. We can then access the vector of fits for each coefficient using sapply and the accessor "[" function. Easy.

    In my run of mtcars I get mpg = 37.285 - 5.344*wt

    The mean of my bootstrap was 37.44509 and -5.41729, respectively with CIs (33.12, 42.51) and (-7.03, -4.18), respectively. Compare with confint results on that mtcars model (33.45, 41.12) and (-6.49, -4.20).

    I'd read up on examples of how to use the boot function in R. It also computes a number of types of confidence intervals from your bootstrapping (I believe what I did above is equivalent to percentile bootstrapping ci?). See Table 11.9 onward here; I roll my own and then show the boot example: http://rpubs.com/bryangoodrich/5225
    You should definitely use jQuery. It's really great and does all things.

  14. #14
    Probably A Mammal
    Points: 32,065, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,567
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates

    Replicating what I did above using the boot package

    Code: 
    library(boot)
    fit <- function(data, index, formula, ...) coef(lm(formula, data = data[index, ]))
    fits <- boot(mtcars, fit, formula = mpg ~ wt, R = 1000)
    
    # Defaults to first variable (index=1)
    boot.ci(fits, type = c("norm", "basic", "perc"))  # I get errors with type = "stud"
    # BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
    # Based on 1000 bootstrap replicates
    # 
    # CALL : 
    # boot.ci(boot.out = b, type = c("norm", "basic", "perc"))
    # 
    # Intervals : 
    # Level      Normal              Basic              Percentile     
    # 95%   (32.65, 41.73 )   (32.62, 41.64 )   (32.93, 41.95 )  
    # Calculations and Intervals on Original Scale
    
    # For the slope coefficient (index = 2)
    boot.ci(fits, type = c("norm", "basic", "perc", index = 2)
    # BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
    # Based on 1000 bootstrap replicates
    # 
    # CALL : 
    # boot.ci(boot.out = b, type = c("norm", "basic", "perc"), index = 2)
    # 
    # Intervals : 
    # Level      Normal              Basic              Percentile     
    # 95%   (-6.668, -3.916 )   (-6.531, -3.717 )   (-6.972, -4.158 )  
    # Calculations and Intervals on Original Scale
    You should definitely use jQuery. It's really great and does all things.

  15. The Following User Says Thank You to bryangoodrich For This Useful Post:

    hlsmith (06-16-2017)

  16. #15
    Omega Contributor
    Points: 39,138, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,086
    Thanks
    404
    Thanked 1,196 Times in 1,157 Posts

    Re: Transporting LASSO Model Results to Logistic Reg for Estimates


    thanks, i will check this out later tonight, it should help. i am very familiar with the bootstrap in SAS, but i have only used it a couple times in R, where a package called on the boot package and did all the wok for me.
    Stop cowardice, ban guns!

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats