[TSBC] The R-Book, chapter 10, regressions.

Status
Not open for further replies.
#1
Hi Fellow TS'rs,

I am very pleased to welcome you all to our first book club session and first book; Michael Crawley's The R Book. We will use the forum as a means of discussion as this medium is immune to timezone problems.

We will start with chapter 10, as a gentle introduction into the other chapters.

The rules.

Discussion will start within this tread now. You may post remarks, improvements on Crawley, indicate problems you are having, ask deeper philosophical questions, but please keep on topic and don't let your posts trail off. The discussion will continue until Monday 11th of July ends on the international date line (note this deadline is after the weekend, which is nice), after which the discussion thread will be closed.

Happy posting everyone!

note: if you want to discuss things like e.g the bookclub rules or suggest other books. Use this thread. The thread below is for the chapter only.
 

bryangoodrich

Probably A Mammal
#2
I may just be tired (and I haven't tested their code to get familiar with it yet), but on the Bootstrapping section (p 418-21), I am curious about the variables he calls. For instance, in his statistic function,

Code:
reg.boot <- function(regdat, index) {
  xv <- explanatory[index]
  yv <- response[index]
  model <- lm(yv ~ xv)
  coef(model)
}
Where are "explanatory" and "response" coming from? I know he is passing 'regdat' into the 'boot' function, but the two variables of interest don't appear to me to be in the scope of that statistic function. Am I right on this or am I missing something? If they were part of the regdat dataframe, there is no 'with' or 'attach' of it to make the call to those variables possible. It seems to me he is calling them from the global environment as in the previous example (i.e., bootstrapping manually w/o the 'boot' function). That would be horrible programming practice if I am right.
 

bryangoodrich

Probably A Mammal
#3
Here's another question to get us started: what is the benefit (if there is one) to using a piecewise regression (p 425-30) over, say, a polynomial regression? When I looked at the output I thought using a higher-order term to introduce curvature might be another alternative. I don't think I've learned anything about piecewise regression (but I have heard of them). Crawley doesn't offer any discussion on this point. Maybe it has something to do with the data being replicants at the specified values of 'Area'? I don't recall that being a problem for a polynomial fit, though. What do you more knowledgeable sort have to say on the matter? Any references to further discussion on piecewise regression?
 

bugman

Super Moderator
#4
That's a good question bryangoodrich.

From my understanding (though it is probably more complicated than my answer) is that breakpoint or piecewise regression might be better suited to interval -type data along a continuous scale, where there are different linear slopes within these intervals. This would then allow predictions within these different intervals based on the different slopes. These can be particularly useful in biological studies where allometrics and feeding behaviour or growth rates are analysed. i.e. you might see different feeding rates with different size scales and you might be interested in isolating specific points that these rates change.

I do agree with your comments relating to the lack of discussion on these points.

I will also add that up until the piecewise regression section, I was doing fine. However, my biggest gripe with this was his statement on pg. 428, when he says "We have intentionally created a singularity in the piecewise regression between Area=100 and Area =1000 (the aliased parameters show up as NAs)". Ok Crawley. So what? Why did you do this? What does it even mean? Previously in the chapter he attempted to explain something in Laymen's terms. I would have liked to see some kind of context provided here, or at least an explanation of this for those who can't figure out why this was done. Or a referral to a glossary.

P.

Edit: I will add that I do like his explanations of how to interpret the diagnostic plots (pg 401 & 402). Basic, I know; but I thought they were nicely explained.
 

Dason

Ambassador to the humans
#5
Oh dear. Less than a week left. I should get reading.

Also to add to the piecewise v. polynomial regression discussion - why should we use polynomial regression? Do you really know the functional form of the underlying expectation? I'm guessing not. So what do we gain by choosing a polynomial over a piecewise regression? I haven't read the section yet so I don't know what type of data they're working with (are there repeated x values or is it all scattered... I don't know). But if we have a couple of x values that have multiple y values then either way the regression is going to contain the mean of those groups. So if we choose a polynomial all we're doing is adding the assumption that the underlying mechanism is smooth and differentiable. Is there a reason we should believe this? If so then sure we could model it as such. If there isn't a reason then why not just use a piecewise regression and admit that we don't really know as much as we would like and are just trying to get a good local approximation.
 
Last edited:

Dason

Ambassador to the humans
#7
Ha. Well I had other things going on the first two days the book club started and then just forgot about it. My bad.

... and Team Fortress 2 became free to play recently... hasn't helped with my productivity.
 

bryangoodrich

Probably A Mammal
#8
You make a good point, Dason. A similar thing can be said about bugman's reply, too. For us to know the segmentation of the data, we'd have to already know something about the data and the process that produced it. If we didn't know the segmentation of the intervals, then we wouldn't know where to break it up or if it were appropriate. To me, I looked at it and thought a more continuous line would fit well, also. But as you point out, that also presupposes something about the data and the process that produced it (i.e., it's smooth and differentiable). In either case, we could still explore the sorts of regressions and their statistical justification, but I think this discussion goes to show that there is always something to be gained by looking beyond the data and into the reality about its production. A data analysis is always improved by having knowledge beyond the data--knowledge about the data.

Thank you both for your excellent explanations.
 

bryangoodrich

Probably A Mammal
#12
Its only a slight diversion. But this is a nice discussion on break point regression for those who want to look a bit further.

Statist. Med. 2003; 22:3055–3071 (DOI: 10.1002/sim.1545)
Estimating regression models with unknown break-points
Vito M. R. Muggeo

http://onlinelibrary.wiley.com/doi/1....1545/abstract
The DOI link works, but the URL has the middle cropped out into "1....1545" instead of "10.1002/sim.1545". Odd.
 

trinker

ggplot2orBust
#13
As I said before I'm the one here with the least ammount of knowledge of stats so be gentle with me;)

Anyway...

In connection to the discussion of the piecewise regression...Dason you said:
Also to add to the piecewise v. polynomial regression discussion - why should we use polynomial regression? Do you really know the functional form of the underlying expectation? I'm guessing not. So what do we gain by choosing a polynomial over a piecewise regression?
As I'm attempting this piecewise regression (a technique I'm not at all familiar with) it feels like I'm over fitting the data. (this is just gut instinct without looking into it further) A polynomial regression is manipulating the data by a constant, where as the piece wise regression is chopping the data up where it seems convienent. It seems to me that you'd run into the same problems you run into with a histogram and breaks. I'm trying to wrap my brain around it (just on what The R Book has given) and it feels like the predictability of the model is worthless.


I guess if the data looked like this (with one break) and polynomial obviously does not fit:
Code:
y<-c((40:20)+sample(-3:3,21,replace=TRUE),(20:40)+sample(-3:3,21,replace=TRUE))
x<-100:141
plot(x,y)
...it may be useful to use piecewise but in the case Crawley provides it seems wrong.

Just some rambling thoughts. Now I'll go read up on it and see if my attitude changes.
 
Last edited:

trinker

ggplot2orBust
#14
I quasi get it now. Crawley uses less breaks than I thought. I misunderstood the plot on p.427. I thought this was his final "regression line".

He even rejects model3 because it is not a significant improvement over model2. I guess this would guard against the dangers I was worried about before.
 
Last edited:

trinker

ggplot2orBust
#15
Is it just me or do the plots, ie. plot(model), not look particularly good for the piecewise regression (the model3 that was selected as the best fit)? The QQ doesn't look great. Can we still use the standard plot(model) and have typical expectations of the graphs?
 

trinker

ggplot2orBust
#16
On page 435 Crawley uses non-parametric smoothers to create a model which he then plots. He then goes on to say,

The confidence intervals are sufficiently narrow to suggest that the curvature in the relationship between ozone and temp is real, but the curvature of the relationship with wind is questionable, and linear model may well be all that is that is required for solar radiation.
I get the radiation comment but see very little difference in the wind and temp plots. Could anyone shed some light on how to interpret them that enables Crawley to accept the curve of one and reject the other?
 
#17
I may just be tired (and I haven't tested their code to get familiar with it yet), but on the Bootstrapping section (p 418-21), I am curious about the variables he calls. For instance, in his statistic function,

Code:
reg.boot <- function(regdat, index) {
  xv <- explanatory[index]
  yv <- response[index]
  model <- lm(yv ~ xv)
  coef(model)
}
Where are "explanatory" and "response" coming from? I know he is passing 'regdat' into the 'boot' function, but the two variables of interest don't appear to me to be in the scope of that statistic function. Am I right on this or am I missing something? If they were part of the regdat dataframe, there is no 'with' or 'attach' of it to make the call to those variables possible. It seems to me he is calling them from the global environment as in the previous example (i.e., bootstrapping manually w/o the 'boot' function). That would be horrible programming practice if I am right.
You are completely correct. That is sloppy coding and it only works because before (on page 418) he uses 'attach' on regdat.

I don't know about you guys but I don't like the statement he makes on page 389. I guess it just semantics but "we want to find the values of the slope and intercept that make the data most likely"

Is it just me or is that weird? As an empiricist I would like to find the most likely model given the data... the data is our closest link to reality.
Stating the above to me is like a judge that views the likelihood of a crime given the suspect. Judgment is made whether the crime has taken place or not (while the crime and the data are facts).
Anyway its a small point.

I'm halfway through the chapter, I didn't have time earlier because its the time of year with important grant deadlines. Up to know I think it was very useful to refresh some of my knowledge.

bugman said:
Edit: I will add that I do like his explanations of how to interpret the diagnostic plots (pg 401 & 402). Basic, I know; but I thought they were nicely explained.
Yep, I also found it useful though a bit sparse
 

Dason

Ambassador to the humans
#18
You are completely correct. That is sloppy coding and it only works because before (on page 418) he uses 'attach' on regdat.

I don't know about you guys but I don't like the statement he makes on page 389. I guess it just semantics but "we want to find the values of the slope and intercept that make the data most likely"

Is it just me or is that weird? As an empiricist I would like to find the most likely model given the data... the data is our closest link to reality.
Stating the above to me is like a judge that views the likelihood of a crime given the suspect. Judgment is made whether the crime has taken place or not (while the crime and the data are facts).
Anyway its a small point.
I'll admit I've been a bad book club member for this chapter - I've only really read sections that people comment on. I guess I don't quite see the problem with his statement. In my eyes it's just another way of saying "Let's do maximum likelihood estimation".
 

bugman

Super Moderator
#19
I don't know about you guys but I don't like the statement he makes on page 389. I guess it just semantics but "we want to find the values of the slope and intercept that make the data most likely"

Is it just me or is that weird? As an empiricist I would like to find the most likely model given the data... the data is our closest link to reality.


A small point, but a good one. I missed that.
 

bugman

Super Moderator
#20
I will also add that up until the piecewise regression section, I was doing fine. However, my biggest gripe with this was his statement on pg. 428, when he says "We have intentionally created a singularity in the piecewise regression between Area=100 and Area =1000 (the aliased parameters show up as NAs)". Ok Crawley. So what? Why did you do this? What does it even mean? Previously in the chapter he attempted to explain something in Laymen's terms. I would have liked to see some kind of context provided here, or at least an explanation of this for those who can't figure out why this was done. Or a referral to a glossary.
So, I was still wondering about this. I may have figured it out, but can some butt in and correct me or elaborate...

I think it was an aliasing technique to remove some degrees of freedom and therefore prevent the model from becoming over parameterised? Am I close? (no cigar though right?).
 
Status
Not open for further replies.