test of non-linearity for linear regression


Fortran must die
I continue to struggle with how to detect this.

This is the type of advice I have seen.
“How to diagnose: nonlinearity is usually most evident in a plot of observed versus predicted values or a plot of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or around horizontal line in the latter plot, with a roughly constant variance.”

I am not sure what this means in practice. In the following from analysis I just ran it does seem to me to be roughly symmetric around the horizontal line, but I don’t think the variance is constant. Does it mean it is non-Linear?


I generated the partial residual plot (sas calls this the partial regression residual plot) which is recommended by some for non-linearity and see no obvious pattern of such (although I have found few details of what to look for in these).


I have about 4700 data points if that matters
It looks pretty linear to me. The non-constant variance is a violation of the homoscedasticity assumption (there will come a time when I don't need to look up the spelling for that one). But that's not important for determining the regression coefficient, only for when you want to predict individual points. "Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points." (Gelman).


Fortran must die
Thanks. I knew about the issue of homoscedasticity, but I never applied that to linearity. The overall model residuals with all predictors is clearly homoscedastic (I know I will never spell these right, Fischer should be shot for using them).

One trick I never knew of is that SAS will generate a loess that makes it more obvious if there is a pattern. For the one above it clear looks linear. The one below is more doubtful but only when it moves to a small group of outliers. I am not sure a predictor which is non-linear only in a small group of outliers should be transformed. Because of questions about the validity of the data we transformed the most extreme point to equal the point where the z score suggested a less extreme value (there are data errors in our data due to various mistakes, we don't input the data so we can never be certain if this has occurred).