# Thread: Interpreting residuals

1. ## Interpreting residuals

Still getting used to interpreting large numbers of data points for regression assumptions.

The attached looks fan shaped to me, which it is said violates the assumptions of equal variance. But I thought I would ask for a second opinion.

Using a qq plot I clearly have non-normality in my data, its highly skewed, but with 30,000 data points I don't think non-normality will have a great impact on the p values...

2. ## Re: Interpreting residuals

Is your response a count? Or possibly a non-negative random variable?

3. ## The Following User Says Thank You to Dason For This Useful Post:

noetsi (06-09-2016)

4. ## Re: Interpreting residuals

A heat map approach (hex binning) would show you were the points are more dense. To be uniform it would be relatively the same color along the horizontal. That's a good way to view distribution within a dense set of points (overplotting).

See http://www.r-bloggers.com/5-ways-to-...stograms-in-r/

In any case, the only thing that matters is: https://en.wikipedia.org/wiki/Hetero...y#Consequences

The coefficients may be as unbiased as they would normally be, but you might rule something not statistically significant when it is due to incorrect standard errors, which may or may not be a significant issue in your final model (e.g., Y ~ A + B + C may require C by fiat, but its not significant possibly due to heteroscedasticity, I'm still going to keep it in).

5. ## The Following User Says Thank You to bryangoodrich For This Useful Post:

noetsi (06-09-2016)

6. ## Re: Interpreting residuals

Dason its cycle time. How many days you were in a given status. I guess that's a count, but it has a very wide range of possible values, thousands of possible values.

I understand bryan that it will not bias the results. However, I am deciding whether to leave something in the model or not and two interesting variables have extremely high p values this way but with effect sizes that are substantively important in my judgment. I ran White's SE although they rarely change the results with my data and it did not here. I have no theory here - that is common in my analysis as none exists - but logically the variables that would be excluded might influence the results.

I am guessing you both think it is not homoscedastic....

7. ## Re: Interpreting residuals

Ok two points from the link bryan noted.

One author wrote, "unequal error variance is worth correcting only when the problem is severe."
What is severe? For example to me the data I posted shows unequal error variance. Is it severe, nothing I read discusses when it is or is not.

I did run White's Heteroscedasticity Consistent standard errors and they, as nearly always with my data, did not change the SE very much. But one thing I have never come across when reading about White's SE is how you can know for sure whether they corrected the problem. Or can you assume they nearly always will....

8. ## Re: Interpreting residuals

The problem with trying to define severe is that it isn't an issue with the model. It's an issue with the objective the model is being used to serve. If I'm trying to predict an outcome within a certain margin of error and it can be shown that this unequal error variance results in wildly fluctuating predictions outside of the margin I'm willing to tolerate, than it's not an issue with the model. It's an issue with my expectations of using the model to service this application. In some applications, it might be alright, but in others it may not.

The problem with testing severity with the model itself is that you've already fit the data. This is why cross-validation methods aim to see how the model performs with new data. You have a ton of data, break it into 10 groups, fit to 9 of those groups and see how poorly it estimates the 10th. Repeat so that you do a prediction for each group. The average of those 10 prediction errors is a good estimate of how well your model fits to new data (ceteris paribus). While each model may have some unequal variance, if it doesn't lead to poor out-of-sample predictions according to your judgement, then its a win (and a benchmark against which you can test other models trying to do the same thing).

9. ## The Following User Says Thank You to bryangoodrich For This Useful Post:

noetsi (06-09-2016)

10. ## Re: Interpreting residuals

That seems useful advice, one thing I have never understood is how (in actual software) you predict the levels of the hold out data from the estimated model and determine the error (with time series there is a simple process, but then you are predicting few points). I know how this works in practice and I can do it with small data sets essentially manually. But I don't know how to do this with thousands of data points.

It looks to me like White's is the generally accepted solution from the literature I looked at. But I found no simulations that showed how likely White is be right or wrong. Commonly they bring up the fact that hetero can be driven by a misspecified model and you should specify it correctly. Which always brings me to the question I have since my first regression course decades ago. In social science your model is always going to be misspecified, because reality is complex and we know too little about what we model (in my area there appears to be little empirical theory, its stress is on social interaction not data).

So how do you fix something that is certainly, and unavoidably, wrong?

11. ## Re: Interpreting residuals

All models are wrong, but some are useful.

You can do k-fold cross-validation on time series data, albeit, with a little creativity. In cross-validation, the folds (groups) are just random assignment of the data, but time series requires that each group maintain the time series structure inherit in the data. Thus, you do a sort of stratified k-fold cross-validation. Alternatively, you can do variations of feed forward leave-one out cv (LOOCV) or repeated LOOCV.

http://robjhyndman.com/hyndsight/tscvexample/
http://robjhyndman.com/hyndsight/crossvalidation/
https://en.wikipedia.org/wiki/Cross-...on_(statistics)

12. ## The Following User Says Thank You to bryangoodrich For This Useful Post:

noetsi (06-20-2016)

14. ## The Following User Says Thank You to hlsmith For This Useful Post:

noetsi (06-20-2016)

15. ## Re: Interpreting residuals

Similar figure with transparency and histograms:

http://analytics.ncsu.edu/sesug/2011/RV08.Watts.pdf

16. ## Re: Interpreting residuals

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts