+ Reply to Thread
Results 1 to 6 of 6

Thread: General guidance on interpreting residuals and diagnostics for linear regressions

  1. #1
    Points: 3,927, Level: 39
    Level completed: 85%, Points required for next Level: 23

    Posts
    85
    Thanks
    1
    Thanked 1 Time in 1 Post

    General guidance on interpreting residuals and diagnostics for linear regressions




    Now that I'm more comfortable with regression (in SAS) and understand the assumptions needed for linear regression, I'm able to run simple models of my data. However, my problem is that I do not know what to do when my model does not fit the assumptions. I've tried to look at various SAS sources but everything I've seen only talks about why a model doesn't fit a certain assumption (I,e because the residuals aren't random), but doesn't really talk about how to correct it (besides a transformation).

    Are there any guidances on kind of the 'order of operations' when interpreting regression?

    For any given simple model with a single independent and dependent variable, I look at the R^2, F statistic, p-value of each independent, and then the residual plots and also the QQ norm. There's a lot to look at and I don't know what/where to start:

    I,e: for a low R squared, does this automatically mean I should add another variable? Or should I add another variable if the R squared does not bump up after a transformation?

    F statistic, is just the overall fit of the model, so once I get everything else corrected, it should move up right?

    for the p-value of each variable and intercept, how do I know whether the high p-value is telling me to get 1)add another variable, 2) do a transformation, or 3)it's a bad variable?

    and for the different plots, does a certain shape indicate a certain method of solution? and what if I find the best transformation, but it still not correct (I,e the residuals are still all clumped at the bottom), or maybe the residuals look better, but the R and P are still not adaquete?

  2. #2
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: General guidance on interpreting residuals and diagnostics for linear regressions

    I have documents that runs tens of typed pages dealing with this topic...its anything but simple.

    Some starting points http://www.listendata.com/2015/03/ch...-multiple.html

    http://people.duke.edu/~rnau/testing.htm

    For Heteroscedasticity the simplest is just looking at the residuals to see if there is a pattern, the spread, getting bigger or smaller...
    For non-linearity look at partial regression plots and see if there is any obvious non-linearity.
    For normality run a QQ plot for the residuals against a normal distribution.
    Running tolerance checks for multicolinearity.

    I am not aware of any formal test for non-independence except for that which occurs in autocorrelation. You can test for that with test such as Durbin Watson or you can look at the ACF in ARIMA (the former is much easier).

    I have a tome of how to do this in SAS, but no way to send it unfortunately. Not that model fit/value which is why you seem to be talking about is different than the test of the assumptions I make above.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  3. #3
    Points: 3,927, Level: 39
    Level completed: 85%, Points required for next Level: 23

    Posts
    85
    Thanks
    1
    Thanked 1 Time in 1 Post

    Re: General guidance on interpreting residuals and diagnostics for linear regressions

    Thanks Noetsi. that documents helps and look really good.

    So to break it down in somewhat simple terms, when working with regression, at the end of the day, I really have only 2 options when correcting right? 1)transformations, and 2)adding/subtracting/switching out independent variables.

    Does the way my diagnostic and results come out indicate which I should be doing ( should I be transforming vs add/subtract variables)?
    I.e, if my residuals were in a pattern vs if my residuals are all clumped up to one side.
    Or is it just trial and error?

    I guess my issue is that when I run a regression andI see all the Heteroscedasticity issues, low R values, etc etc, I know it's a bad fit and I know what it is *supposed* to look like, but I don't know which method will fix it.

  4. #4
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: General guidance on interpreting residuals and diagnostics for linear regressions

    Quote Originally Posted by semidevil View Post
    So to break it down in somewhat simple terms, when working with regression, at the end of the day, I really have only 2 options when correcting right? 1)transformations, and 2)adding/subtracting/switching out independent variables.
    You can also choose methods that are more robust to particular distributional assumptions - e.g., bootstrap to avoid relying on normal assumption for trustworthy confidence intervals or p values.

    In terms of transfomations and additions/subtractions of variables... this does kinda depend on what you're actually trying to achieve, but in general I'd say pick a model ahead of time that actually allows you to answer your research question, and then don't mess around with it. You want to avoid p-hacking and the garden of forking paths. If there are multiple suitable models to answer your question, report them all.
    Matt aka CB | twitter.com/matthewmatix

  5. #5
    TS Contributor
    Points: 14,811, Level: 78
    Level completed: 91%, Points required for next Level: 39
    Miner's Avatar
    Location
    Greater Milwaukee area
    Posts
    1,171
    Thanks
    34
    Thanked 405 Times in 363 Posts

    Re: General guidance on interpreting residuals and diagnostics for linear regressions

    Quote Originally Posted by semidevil View Post
    I,e: for a low R squared, does this automatically mean I should add another variable?
    Sometimes, this might be a lurking variable for which you have no data. Often, it is due to variation in the measurement itself (i.e., repeatability and reproducibility).

  6. #6
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: General guidance on interpreting residuals and diagnostics for linear regressions


    Quote Originally Posted by semidevil View Post
    Thanks Noetsi. that documents helps and look really good.

    So to break it down in somewhat simple terms, when working with regression, at the end of the day, I really have only 2 options when correcting right? 1)transformations, and 2)adding/subtracting/switching out independent variables.

    Does the way my diagnostic and results come out indicate which I should be doing ( should I be transforming vs add/subtract variables)?
    I.e, if my residuals were in a pattern vs if my residuals are all clumped up to one side.
    Or is it just trial and error?

    I guess my issue is that when I run a regression andI see all the Heteroscedasticity issues, low R values, etc etc, I know it's a bad fit and I know what it is *supposed* to look like, but I don't know which method will fix it.
    It depends on what your issue is. For example in dealing with heteroscedasticity I most commonly rely on White's robust standard errors. A common solution for non-linearity is to specify a quadratic or sometimes cubic term. Your diagnostics tells you what the problem is. You have to consider the literature in how you address it.

    If I can figure out how to send the document I have on this I will. It is very long, years worth of work, but it will probably help. But solutions are judgmental to some extent.

    I don't rely on R squared at all. For some analysis you can have a low R squared and still do it correctly. Its because the phenomenon in question is so complex that you are never going to get a high R square. And what high is varies with the type of analysis. Heteroscedasticity and extreme outliers are often indications of a variable you left out and this is almost certainly better than R squared. Partial regression plots are useful if you can identify the variable you left out.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats