Bookclub: ISLR

rogojel

TS Contributor
#1
Hi,
I just pledged to work my way through this book (Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani). Anyone willing to join? Keeping the discipline would be more fun if we worked together.

regards
 

Junes

New Member
#2
I'm in! I already did the first 4 chapters of this book during an intensive two-week course, then I got sick and kind of scrambled to keep up, mostly sticking with lecture notes and slides. So I would be up to re-read the first chapters and finish the rest.

It's a pretty good book, by the way. Just in the sweet spot between accesible and rigorous, for me at least.

For anyone reading this, the book is freely and legally available as pdf.
 

hlsmith

Omega Contributor
#3
I have read portions, and may be interested, but I need goals and dialogue between us. In 20 days my schedule opens up and I gonna do all kinds of reading.

Interesting enough I bought the new Hastie and efron book last night.
 
#4
Hi,
I just pledged to work my way through this book (Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani). Anyone willing to join? Keeping the discipline would be more fun if we worked together.

regards
Sounds nice, but I don't think I can commit fully to that. But it would be interesting to listen to, and maybe participate in discussions.

Aha, so IRSL means "Introduction to Statistical Learning with Applications in R". I guess that Hastie-Tibshirani have made the right decision to skip 'estimation' and call it 'learning' and by that include the machine-learning people.


I have read portions, and may be interested, but I need goals and dialogue between us. In 20 days my schedule opens up and I gonna do all kinds of reading.

Interesting enough I bought the new Hastie and efron book last night.
It seems like Hlsmith is referring to this book.

I am glancing a book by McElreath "Statistical Rethinking A Bayesian Course with Examples in R and Stan" which starts at an elementary level but is interesting.

I am also looking in Schweder Hjort "Confidence, Likelihood, Probability". A preliminary version of that book can be found here.

It seem like that book talks about some of the areas as Efron Hastie does.
 

hlsmith

Omega Contributor
#5
GG, I have thought about the Statistical Rethinking book. Every time I go to Amazon its Recommender recommends it to me. Also they recommended it during the Gelman lecture earlier this week.
http://stan.fit/2016/10/27/intro-to-bayes-webinar


I have also been eyeing the ESL book, but figure you were suppose to read their first book ISLR beforehand. I will let you look that one up GG.


Lastly, I keep wanting to read this book in its entirety, which they are working on a follow-up book to it.
http://www.targetedlearningbook.com/


PS, I also need to read Gelman's multilevel book and I need to finally read a longitudinal book. More for leisure I am planning to read Black Swan at the end of the month.
 

rogojel

TS Contributor
#8
So,
my first musing /question about the book: in the chapter about Classification we discuss linear and quadratic discriminant analysis. It is also said that for more than two categories LDA is preferred over logistic regression. However there was no significance calculation given for LDA and there is none in the R output either. Also, I do not recall anyone ever recomending LDA in this forum instead of a logistic regression, not to mention recomending QDA. Is this because we can have no p-values? How would one calculate the sample size?
 

rogojel

TS Contributor
#9
I just played with Exercise 10 from Chapter 4 – predicting if the stock exchange would go up or down using weekly data. I got an interesting surprise – the exercise proposed that I split the training and the test data according to the year – everything before 2009 was training and after, until 2010 was test data. It also proposed to only use Lag2 as a predictor, out of the 5 available lags and the trade volume. I was not sure about splitting according to time – my suspicion was that if there was a trend or any other time-related pattern this might not be captured in the test set in the same way as in the training, leading to biased performance.

So, I did the logistic regression in 3 cases : with all the available data I got Lag2 as the significant predictor. However, if I ran the logistic regression on the train data alone, then Lag1 was significant and Lag2 was not. I guess this means that probably neither of them is significant, all we see is some fluke in the data. I the decided to take a completely random selection of 800 points as the train data – and sure enough there were no significant predictors there.

Now, apart from the true objective of the exercise, this raises interesting questions about our use of regression and model selection. I would have accepted either Lag1 or Lag2 as a legitimate predictor in any analysis, and I guess anyone would have accepted them as well. Given the recent discussions on the value of the p-value as a tool, this is quite sobering. Maybe, one could extend the p-value testing to require that train and test samples should be used as well? I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?

BTW of the possible choices for a classification algorithm logistic regression with a threshold of 0.5 behaved very poorly, and LDA was only marginally better . Both algorithms essentially bet on a upward movement – the logistic regression only predicted 7 downward movements out of a total of 289. Because there were more upward movements then downward this got them a true positive rate of around 51-52% . Surprisingly QDA got a whooping 58.5% with KNN with k=1 being as bad as logistic regression but k=5 slightly better than logistic regression and QDA. I actually never had QDA on my radar, I guess this will change now.
 

hlsmith

Omega Contributor
#10
rogojel, slow down. My schedule doesn't open up until late next week - then I will be all over this with you.


Yes, these authors highly champion the use of cross-validation if sample is marginally large and Training, Testing, and validation sets for larger data, or leave-one-out for small samples. My problem is that I usually don't get large datasets.


Did you do any model scoring? That is something that I have not really done.


Not too familiar with lags. Were they calling the older dataset the lag. I always think of it as the run-in data, so say prior 3 days or something like that in panel data. I am right in thinking this?
 

Junes

New Member
#11
Just wanted to let you guys know I'm still interested, just don't have the time at the moment to look at this in detail right now.

But, in the weekend I will have more time. I also remember doing this exercise so I will have a look at it again.
 

rogojel

TS Contributor
#12
:) the advantages of a frequent traveller - since I have a Surface I read the book and use R on the plane. It definitely accelerates things.

@hlsmith: the data is a time series and the lags are simply data shifted by one, two....five periods.

I only rated the models based on the true positive rate - and sortof based on aestetics as well, where I regard a confusion matrix that is well balanced aesthetically more satisfying as one that only gets one category right.
 

rogojel

TS Contributor
#13
Last exercise for classification: the Boston data-set from the MASS library. The goal is to predict which districts will have an above median crime rate based on all sorts of descriptive data. Since I just learned about LDA, this is what I tried first.

Vanilla attempt - on a training set of 80% randomly selected data : 87% TPR but not nice at all,the model basically just decided to always pick TRUE . It did not get the FALSEs at all but I had mostly Ts in the test dataset so..

Trying the QDA method and voila: 93% TPR and a well balanced confusion matrix.

My problem with LDA is that it does not give a p-value or any clue as to which variables are important in the model and which aren`t - so I just decided to prune the model based on the group means reported in the output on the basis of "large difference - stays, small difference goes".

Group means:
zn indus chas nox rm
FALSE 22.882353 6.406639 0.05462185 0.4635324 6.421639
TRUE 1.831325 13.940723 0.13253012 0.6295120 6.206181
age dis rad tax ptratio
FALSE 49.58824 5.270933 4.189076 296.1261 17.76891
TRUE 85.70301 2.601809 10.518072 434.3253 18.39518
black lstat medv
FALSE 388.8326 9.114664 25.40756
TRUE 367.2677 14.566928 22.40964
The new pruned lda model improved to about 90% - the qda just stayed where it was. So, in conclusion there might be a way to improve an lda model but generally qda will perform a bit better for the price of more variance I guess.

So, trying the other standard methods - logistic regression performed about as well as lda but pruning based on p-values reduced model oerformance a lot.

knn was almost as good as the qda method (almost) . Interestingly increasibg k from 1 to 5 did not improve the model at all. I really expected that it would but apparently k=1 was already capturing all structure in the data .
 

hlsmith

Omega Contributor
#14
So you are still on the classification chapter. I may read it over the next couple of days. I had been working on learning interrupted time series, which seems very straightforward as long as there aren't any time varying confounder or observation level confounders.
 

rogojel

TS Contributor
#15
Yepp, had two rather heavy weeks, but now I am on a kind of a vacation with lots of free time -- already reading the resampling chapter.
 
#16
I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?
 

rogojel

TS Contributor
#17
Yes,
or turning it around: if there is no indication or contradictory indications for an effect in smaller data-sets then it would be legitimate to think that the effect is just a fluke, I think.

regards
 

rogojel

TS Contributor
#18
So,
a bit late, due to the year-end hassle, but still keeping at it - I am working now on chap. 6 - Regression and especially model selection, ridge and lasso.

I just finished ex. 8 where I had to generate a random X and a Y that was a polynomial function of X with degree 3 plus noise of course. Then generate the powers of X up to 10 and try to find a regression model correctly describing the X-Y relationship.

My first surprise was that the regsubsets function from the leaps package did a pretty good job identifying the model with 3 variables . I tried three selection criteria, cp, rsquared.adjusted and BIC . If I went for the minimum then only BIC picked the right model but if I went for the "knee" in the graphical representation then all three were obviously identifying the model with 3 parameters as the best one.

Using the lasso the "best" model found by cross validation also identified 3 parameters, but only if I picked the lambda.1se and not the lambda.min- which was my intuitive choice anyway.

As I knew the parameter values I could also compare the lm model's guess to that of the lasso - and interestingly the lm model was somewhat better. Also comparing the MSE on a new set of similarly generated data lm performed better.

So, I repetead the exercise by adding a lot more noise . In this case the performance of the lasso MSE-wise was closer to tjat of the lm but still the simple lm model was better.

regards
 

hlsmith

Omega Contributor
#19
I am jealous, I am still a little too busy and lazy to commit. The LASSO seems to outperform when it is more of a p > n scenario I believe, and variable may be correlated. And as you know the CV helps more with the overfitting and out-of-sample application.
 

rogojel

TS Contributor
#20
I got slowed down due to work but still have the ambition to continue - so the last exercise for chapter 6:predicting the crime rate in Boston - the dataset Boston from the library MASS.

The task is to generate all the models that were developed in the chapter. I generated a random sample of 100 datapoints for testing and left 406 in the training.

The first thing I learned is that in the presence of some outliers the test-set performance of the models can be hugely variable. For the exact same model, depending on the test-set I could get an MSE of 100 or 10 . The effect depended on whether some outliers got into the test-set or not - of course an outlier in the test - set meant that it had no influence on the model but generated a large residual.

So, comparing the methods - again the simple regression (with interactions) performed on the average better then either the lasso or the regression. PCR was somewhere in between the regression and the lasso while PLS got very close to the simple regression. Given how much more difficult it would be to explain a PLS as compared to the regression the simple regression still seems to be the winner - but the number of variables was really not high enough to see the advantages of the more sophisticated methods.

Another point - it does make sense to include nonlinearities and interactions into the models - This would be easy with a simplle regression - for all the others I just added product columns to the dataset (could try squares as well). The tendency did not change as far as model performance was concerned, but the MSEs went down for all the models.

Also, the outliers complicate the modelling a lot - so exploratory analysis would be a must for any modelling . This does not seem to be a great discovery, but one tends to forget this in the heat of a project.


So, on to chapter 7...