Lasso regression

#1
I am trying to learn this and already I have several questions :p

1) Is it still true there is no (non bootstrap) way of generating SE for lasso? If so how do you do statistical test?
2) One article I read said that all variables had to be standardized to use bootstrap? Is that true?
3) I understand that lasso assigns a penalty to shrink various estimates. But I don't understand substantively which will shrink more than others (that is the basis they will shrink). I have to admit although I know of penalties in regression, I don't understand how they work.

For example on article says "The less important features of a dataset are penalized by the lasso regression. The coefficients of this dataset are made zero leading to their elimination. The dataset with high dimensions and correlation is well suited for lasso regression. "

How does the regression decide in practice what the less important features are minimized (I know an algorithm is used of course and about regularization).
 
Last edited:
#2
Actually while we are at it... when does correlation between variables become significant for Lasso. What is highly correlated (and is correlation that matter or multicollinearity).

" When we apply Lasso regression to a model which has highly correlated variables, then it will retain only a few variables and sets other variables to be zero. That will lead to some loss of information as well as lower accuracy of the model."
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
LASSO serves to locate the 'regular' terms that show up as being associated with the DV. I would not recommend it if you didn't know the underlying associations with all other variables, but in the case of survey data it is likely fine to get the job done.
 
#4
My concern is that all the predictors are measures of some dimension of satisfaction. So inherently some will be correlated.

Do you have a way to generate standard errors for bootstrap?
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
Do you have a way to generate standard errors for bootstrap?
What is this in regards to? In LASSO, it is actually inappropriate to use the estimates from the LASSO, since those data were used to select the model. The SEs are off and I believe the bootstrap does not fix this. So you either need to correct them via selective inference process or to fit the selected model on a random holdout data set using basic logistic reg.
 
#6
I had read that this was an issue. But I think you are using this as a selection tool for the variables only. They were using the actual estimates of the LASSO as well in order to deal with overfitting. That is not using it just for a selection tool.
 
#7
For the half of the data using lasso, roughly does this look right?

proc glmselect data=Simdata plots=all; partition fraction(validate=.3); class c1 c2; model y = c1|c2|x1|x2|x3|x4|x5 @2 / selection=lasso(stop=none choose=validate); run;

The MODEL statement request that a linear model be built using all the effects (c1, c2, x1, x2, x3, x4 and x5) and their two-way interactions. The PARTITION statement randomly reserves 30% of the data as validation data and uses the remaining 70% as training data. The training set is used for fitting the models, and the validation set is used for estimating the prediction error for model selection.

This is what results. So you would chose all these non-zero variables to run in the next step (in logistic regression in my case)?

1631809680647.png

The big concerns I have with this approach is that 1) I only have 440 useful cases and only 44 of one of the levels of the DV. Is that enough data to analyze if half of it is used for Lasso (for what it is worth there are only about 1000 people in the organization).

I am also worried about correlated predictors. But if I understood what you said this is not a big issue.
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#8
Correct, if you are just trying to get a best subset, the data splitting may not be necessary since you don't care about the empirically estimated standard errors. However, given the number of covariates, it may be beneficial to refit a model in a holdout set and get the estimates with confidence intervals and plot them all to make a final decision of which variable(s) seem to have a higher association. I say this because you are likely going to have a few that look like they explain the same amount of signal and looking at their estimates and precision together may help you make your final decision.
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
I thought you always had the full population and sample size was not an issue. So that is not the case here?

I am not sure how SAS treats the partition portion, I would guess that it is using it to figure out what shrinkage penalty to use. If so, I would still you the holdout set. If not, that part may not be necessary.
 
#10
Usually we do. Every few years we do a survey and in that case I have about 40 percent of the population and far fewer cases. That is what I am doing now.

SAS uses part of the data to decide on L1. Then it uses the rest to estimate the data in Lasso.

The real issue is that I will probably only have about 20 out of 220 total cases in the DV at one level. Hopefully that will be enough.

If I understand correctly you would still split the data into a set to decide on the lasso and the rest to estimate the logistic regression.
 

hlsmith

Less is more. Stay pure. Stay poor.
#11
Yeah given the information you provided that would be the recommendation. Well even if you have sparse data - the process will help you ensure nominal or spurious variable aren't sneaking in - since you will likely end up with just a few regular term that are generalizable to the other 60% of unknown data given they are similar.
 
#12
What worries me most is I don't really understand the underlying logic the method uses to say this variable should be shrunk and this should be shrunk less. Which determines which get thrown out (I assume that certain variables are shrunk at a higher rate than others although I have yet to see an author actually say that)? I understand, sort of, what it does. But I don't understand the substance behind it which makes me worry that the wrong variables will get tossed.

Amusingly I found this...

"Thirdly, any automatic method will be inadequate because (as noted above) there are considerations other than simple “model fit” to consider. Automatic methods cannot substitute for substantive knowledge and thought." So I guess the partial answer is lasso looks at model fit. But what the substantive issues to consider are in practice are way beyond me.

Of course this is true of every statistics I run :) It would be nice to understand the math behind it but I have found out painfully that the math behind such is beyond my math skills. I have never been able to learn calculus or linear algebra.
 
Last edited:
#13
The good news is that I ran a test of VIF, using proc reg since proc logistics does not do this, and none of my predictors had a VIF over 3.
 
#14
Never mind I found this

"Lasso shrinks all β values by the same amount."

But that confuses me. If it is true you could just rank order the standardized variables (and all dummies if you have those) it seems and pick as many as you want in your model :p It can't be that simple.
 
Last edited:
#15
I only have 450 cases. So my use of lasso would only be about 220 cases. Which some authors suggest is too small to use k fold methods that seem central to lasso.

"Think a bit more about the issue of small N. If N is moderate (say N=200 observations) a fivefold split will create a training set with 160 observations and the training set is likely to approximate the next data set encountered. However a K=5 split produces a test data set that only has 40 observations. It is likely that this test set can differ from the data to which you wish to generalize. Forty is likely to be too small to be a good test set."

sas's default is 10 not five :p
 

hlsmith

Less is more. Stay pure. Stay poor.
#17
For the half of the data using lasso, roughly does this look right?

proc glmselect data=Simdata plots=all; partition fraction(validate=.3); class c1 c2; model y = c1|c2|x1|x2|x3|x4|x5 @2 / selection=lasso(stop=none choose=validate); run;

The MODEL statement request that a linear model be built using all the effects (c1, c2, x1, x2, x3, x4 and x5) and their two-way interactions. The PARTITION statement randomly reserves 30% of the data as validation data and uses the remaining 70% as training data. The training set is used for fitting the models, and the validation set is used for estimating the prediction error for model selection.

This is what results. So you would chose all these non-zero variables to run in the next step (in logistic regression in my case)?

View attachment 3656

The big concerns I have with this approach is that 1) I only have 440 useful cases and only 44 of one of the levels of the DV. Is that enough data to analyze if half of it is used for Lasso (for what it is worth there are only about 1000 people in the organization).

I am also worried about correlated predictors. But if I understood what you said this is not a big issue.
Not sure you need the interaction terms blindly included. I would run it without interactions and just move the heavy hitters along to the holdout set based model.
 
#19
I find this statement confusing.

"CVMETHOD=RANDOM(10) option in the MODEL statement requests 10-fold cross validation where the training data is partitioned into five random subsets for example observations {1,6,11,. . . }, {2,7,12,. . . }, and so on. "

Why five not ten subsets?