# Thread: General guidance on interpreting residuals and diagnostics for linear regressions

1. ## General guidance on interpreting residuals and diagnostics for linear regressions

Now that I'm more comfortable with regression (in SAS) and understand the assumptions needed for linear regression, I'm able to run simple models of my data. However, my problem is that I do not know what to do when my model does not fit the assumptions. I've tried to look at various SAS sources but everything I've seen only talks about why a model doesn't fit a certain assumption (I,e because the residuals aren't random), but doesn't really talk about how to correct it (besides a transformation).

Are there any guidances on kind of the 'order of operations' when interpreting regression?

For any given simple model with a single independent and dependent variable, I look at the R^2, F statistic, p-value of each independent, and then the residual plots and also the QQ norm. There's a lot to look at and I don't know what/where to start:

I,e: for a low R squared, does this automatically mean I should add another variable? Or should I add another variable if the R squared does not bump up after a transformation?

F statistic, is just the overall fit of the model, so once I get everything else corrected, it should move up right?

for the p-value of each variable and intercept, how do I know whether the high p-value is telling me to get 1)add another variable, 2) do a transformation, or 3)it's a bad variable?

and for the different plots, does a certain shape indicate a certain method of solution? and what if I find the best transformation, but it still not correct (I,e the residuals are still all clumped at the bottom), or maybe the residuals look better, but the R and P are still not adaquete?

2. ## Re: General guidance on interpreting residuals and diagnostics for linear regressions

I have documents that runs tens of typed pages dealing with this topic...its anything but simple.

Some starting points http://www.listendata.com/2015/03/ch...-multiple.html

http://people.duke.edu/~rnau/testing.htm

For Heteroscedasticity the simplest is just looking at the residuals to see if there is a pattern, the spread, getting bigger or smaller...
For non-linearity look at partial regression plots and see if there is any obvious non-linearity.
For normality run a QQ plot for the residuals against a normal distribution.
Running tolerance checks for multicolinearity.

I am not aware of any formal test for non-independence except for that which occurs in autocorrelation. You can test for that with test such as Durbin Watson or you can look at the ACF in ARIMA (the former is much easier).

I have a tome of how to do this in SAS, but no way to send it unfortunately. Not that model fit/value which is why you seem to be talking about is different than the test of the assumptions I make above.

3. ## Re: General guidance on interpreting residuals and diagnostics for linear regressions

Thanks Noetsi. that documents helps and look really good.

So to break it down in somewhat simple terms, when working with regression, at the end of the day, I really have only 2 options when correcting right? 1)transformations, and 2)adding/subtracting/switching out independent variables.

Does the way my diagnostic and results come out indicate which I should be doing ( should I be transforming vs add/subtract variables)?
I.e, if my residuals were in a pattern vs if my residuals are all clumped up to one side.
Or is it just trial and error?

I guess my issue is that when I run a regression andI see all the Heteroscedasticity issues, low R values, etc etc, I know it's a bad fit and I know what it is *supposed* to look like, but I don't know which method will fix it.

4. ## Re: General guidance on interpreting residuals and diagnostics for linear regressions

Originally Posted by semidevil
So to break it down in somewhat simple terms, when working with regression, at the end of the day, I really have only 2 options when correcting right? 1)transformations, and 2)adding/subtracting/switching out independent variables.
You can also choose methods that are more robust to particular distributional assumptions - e.g., bootstrap to avoid relying on normal assumption for trustworthy confidence intervals or p values.

In terms of transfomations and additions/subtractions of variables... this does kinda depend on what you're actually trying to achieve, but in general I'd say pick a model ahead of time that actually allows you to answer your research question, and then don't mess around with it. You want to avoid p-hacking and the garden of forking paths. If there are multiple suitable models to answer your question, report them all.

5. ## Re: General guidance on interpreting residuals and diagnostics for linear regressions

Originally Posted by semidevil
I,e: for a low R squared, does this automatically mean I should add another variable?
Sometimes, this might be a lurking variable for which you have no data. Often, it is due to variation in the measurement itself (i.e., repeatability and reproducibility).

6. ## Re: General guidance on interpreting residuals and diagnostics for linear regressions

Originally Posted by semidevil
Thanks Noetsi. that documents helps and look really good.

So to break it down in somewhat simple terms, when working with regression, at the end of the day, I really have only 2 options when correcting right? 1)transformations, and 2)adding/subtracting/switching out independent variables.

Does the way my diagnostic and results come out indicate which I should be doing ( should I be transforming vs add/subtract variables)?
I.e, if my residuals were in a pattern vs if my residuals are all clumped up to one side.
Or is it just trial and error?

I guess my issue is that when I run a regression andI see all the Heteroscedasticity issues, low R values, etc etc, I know it's a bad fit and I know what it is *supposed* to look like, but I don't know which method will fix it.
It depends on what your issue is. For example in dealing with heteroscedasticity I most commonly rely on White's robust standard errors. A common solution for non-linearity is to specify a quadratic or sometimes cubic term. Your diagnostics tells you what the problem is. You have to consider the literature in how you address it.

If I can figure out how to send the document I have on this I will. It is very long, years worth of work, but it will probably help. But solutions are judgmental to some extent.

I don't rely on R squared at all. For some analysis you can have a low R squared and still do it correctly. Its because the phenomenon in question is so complex that you are never going to get a high R square. And what high is varies with the type of analysis. Heteroscedasticity and extreme outliers are often indications of a variable you left out and this is almost certainly better than R squared. Partial regression plots are useful if you can identify the variable you left out.

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts