Mixing lasso with OLS


TS Contributor
I am looking at a data set of 45 continuous IV and 1 DV, trying to find an acceptable regression model. I tried the lasso with cross-validation and ended up with about 10 non-zero coefficients in the best model. However, when I ran the sanity test of building the OLS model with the 10 non-zero IVs half of them had very high p-values (about 0.8-0.7). I took those IVs out of the model and ended up with a quite reasonable final set.

My questions would be:

0. Does this make sense or is this approach completely stupid?

1. Are the lasso model and the OLS even comparable? I.e. is it reasonable to expect low p-values in the OLS regression for the IVs that have non-zero coefficients in the best lasso model?

1.a If not, how can I trust a model that has non-significant IVs in it?

2. Are there any arguments for not looking at the p-values at all and just going with the best lasso model?

2.a If my main interest is in finding parameters that might physically affect the outcome should I trust the lasso or the OLS (especially given the non-significant parameters in the lasso)?

Many thanks for any help, I am quite a novice in the lasso regression but find it very interesting.

My answer to your questions is simply and frankly that I don't know !

Maybe you can gain from reading the free book by Hastie et al The Elements of Statistical Learning


On page 82 (and before that) they discuss model choice.

I would say that your procedure seems reasonable. I guess that a lot of people would do like what you have done.

But I doubt that the “p-values” can have an interpretation as error rates. I mean you have already sorted out a good candidate group from 45 variables. But who would not use the p-values as description? (As description rather than tests.)

How many variables that are selected depends on the parameter “lambda” doesn't it? I guess that your lambda was selected by cross validation. A stricter value of lambda would leave out the “non-significant” variables as I understand it.


Probably A Mammal
If I understand your method correctly, you

1. Used cross-validation to fit a lasso regression that resulted in 10 out of the 45 possible independent variables?

2. You then fit an OLS using those 10 independent variables and received spurious results for the fit of that model?

If that is correct, I'd have to ask how you did your cross-validation. How many k-folds did you use in your CV? How many observations did you have and are each of the observations independent (e.g., this isn't time series)? What did you parametrize in your regularization, the number/selection of independent variables?

I'd also add that I wouldn't necessarily expect that an OLS fit derived from a regularization method would show a good fit. The point of regularization is to trade-off between bias and variance. You want to have lower variance, and therefore better test prediction rates, for more bias. That's why you would do something like ridge or lasso instead of OLS.


When using regularization for variable selection (dimension reduction), you can also consider alternative methods better suited, for lasso may not always be appropriate as mentioned here: http://www.stat.purdue.edu/~tlzhang/mathstat/ElasticNet.pdf (pdf page 2).


TS Contributor
hi, thanks for your answers! I actually got the idea while reading Hansie :) but I guess my problem is a bit different. I need to optimize a complex machine and for practical reasons we can not do a doe. So the idea is to build a statistical model, find an optimum in the model, try it , improve or change the model and repeat.

So my primary interest in a model is to select a number of parameters that relate well to the physical outcome, that is why I am , maybe unjustly, wary of non-significant variables in the model.

BTW I just used the most vanilla cross validation, cv.glmne with default settings and the parameter was the lambda.



Probably A Mammal
I watched a presentation on using R to do an evolutionary algorithm that simultaneously tested a lot of parameters while adjusting its own mutation rate parameter so that it would search for a local optimum. Maybe this problem would fall under that sort of situation? I'm not sure. The subject matter is new, but it's part of the mahout library (for working on a Hadoop system; I was at a Hadoop conference!). The code is open source, though. If I can find the packages he used or a concrete example, I'll drop it here.