Yes,
or turning it around: if there is no indication or contradictory indications for an effect in smaller data-sets then it would be legitimate to think that the effect is just a fluke, I think.
regards
I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?
Yes,
or turning it around: if there is no indication or contradictory indications for an effect in smaller data-sets then it would be legitimate to think that the effect is just a fluke, I think.
regards
So,
a bit late, due to the year-end hassle, but still keeping at it - I am working now on chap. 6 - Regression and especially model selection, ridge and lasso.
I just finished ex. 8 where I had to generate a random X and a Y that was a polynomial function of X with degree 3 plus noise of course. Then generate the powers of X up to 10 and try to find a regression model correctly describing the X-Y relationship.
My first surprise was that the regsubsets function from the leaps package did a pretty good job identifying the model with 3 variables . I tried three selection criteria, cp, rsquared.adjusted and BIC . If I went for the minimum then only BIC picked the right model but if I went for the "knee" in the graphical representation then all three were obviously identifying the model with 3 parameters as the best one.
Using the lasso the "best" model found by cross validation also identified 3 parameters, but only if I picked the lambda.1se and not the lambda.min- which was my intuitive choice anyway.
As I knew the parameter values I could also compare the lm model's guess to that of the lasso - and interestingly the lm model was somewhat better. Also comparing the MSE on a new set of similarly generated data lm performed better.
So, I repetead the exercise by adding a lot more noise . In this case the performance of the lasso MSE-wise was closer to tjat of the lm but still the simple lm model was better.
regards
I am jealous, I am still a little too busy and lazy to commit. The LASSO seems to outperform when it is more of a p > n scenario I believe, and variable may be correlated. And as you know the CV helps more with the overfitting and out-of-sample application.
Stop cowardice, ban guns!
I got slowed down due to work but still have the ambition to continue - so the last exercise for chapter 6redicting the crime rate in Boston - the dataset Boston from the library MASS.
The task is to generate all the models that were developed in the chapter. I generated a random sample of 100 datapoints for testing and left 406 in the training.
The first thing I learned is that in the presence of some outliers the test-set performance of the models can be hugely variable. For the exact same model, depending on the test-set I could get an MSE of 100 or 10 . The effect depended on whether some outliers got into the test-set or not - of course an outlier in the test - set meant that it had no influence on the model but generated a large residual.
So, comparing the methods - again the simple regression (with interactions) performed on the average better then either the lasso or the regression. PCR was somewhere in between the regression and the lasso while PLS got very close to the simple regression. Given how much more difficult it would be to explain a PLS as compared to the regression the simple regression still seems to be the winner - but the number of variables was really not high enough to see the advantages of the more sophisticated methods.
Another point - it does make sense to include nonlinearities and interactions into the models - This would be easy with a simplle regression - for all the others I just added product columns to the dataset (could try squares as well). The tendency did not change as far as model performance was concerned, but the MSEs went down for all the models.
Also, the outliers complicate the modelling a lot - so exploratory analysis would be a must for any modelling . This does not seem to be a great discovery, but one tends to forget this in the heat of a project.
So, on to chapter 7...
Chapter 7 is about dealing with nonlinearities - building multiple regression models with polynomials, splines, step functions etc.
Here, my biggest difficulty is to pick a suitable model - is a spline better than a polynomial, or maybe I should use a step function? Ex. 7 is about modelling the mileage of autos from a dataset of about 400 cars. There is a clear non-linearity in de dependency for most parameters.
Trying to pick the right model I first tried to use spline functions and the results were great as far as the R-squared values and the p values were concerned (great as in R-sq of 0.99 and p values of 0.001, roughly). So, the next question was the number of knots to pick.
To cut a long story short, I went with the basic idea of using cross validation to pick the number nodes - expecting bad performance for too few and too many nodes, due to overfit and too stiff models. My big surprise was that the spline models completely failed to produxce any pattern with increasing number of nodes . It is possible that I made a mistake somewhere, of course.
As a counter-example I re-ran the cross validation with loess, varying the span as a substiturte for the number of nodes and I could find a region of the parameter that was about a factor of 5 better at prediction than any spline model I came up with. I could also clearly see a pattern - for very flexible models my RMSE was about the same as that of the spline models, stiffening the model brought a large increase in the predicted RMSE .
This fits nicely with a discussion about the uses of the R-squared metric. It seems to me that as far as predictive quality goes, the R-squared is as good as useless. If I stayed with the model with the best R-Squared performance I would never have tried the alternatives to an arbitrary spline model, though the loess can be five times as good in prediction.
Tweet |