residuals linearity

noetsi

No cake for spunky
#1
There is really two issues here. First which of the many residuals that can be generated to look (this is from SAS). Second, I am trying to detect non-linearity (I tested for multicollinearity and there is none, with 5,000 plus points I don't care about heteroskedasticity or normality). But I can never tell with blobs of data like this whether the data is non-linear. None of the books I have ever seen show residuals like this in discussing the data.



1663873420137.png
 

Miner

TS Contributor
#2
I have seen this pattern when you have a boundary condition that limits the range of the residuals. It is easier to see when you are using regular residuals. Standardized and Studentized residuals obscure that. The graph below shows it better. Try plotting the residuals against each of the variables in your model. That will help you identify which variable is responsible. They really should cover this in regression textbooks because I have seen it regularly and had to figure it out on my own.

BTW, if you see a lot of diagonal lines in your residuals, it is probably caused by using an integer predictor.

1663880114679.png
 

noetsi

No cake for spunky
#3
Only two of my predictor variables are not ordinal and one is effectively ordinal (the number of counselors is in theory unbounded and interval, but in practice is rarely greater than six). The dependent variable, income, has many levels, but income for this group is low so it might be effectively bounded.

Thanks miner. From what you say this is probably not a case of non-linearity, although I am not sure. I don't really care about the regression assumptions other than non-linearity, because they don't bias the estimate, because I have 5,000 plus data points, and because I actually have the population.

Can this type of issue distort the regression findings? Particularly the estimates?
 

Miner

TS Contributor
#4
It does not appear to show the indicators of nonlinearity. You want to focus on the most dense portion of the plot. Nonlinearity is usually pronounced. Of course you can always add a quadratic term and see what that does to your model (i.e., significance of term, AIC/BIC).

It does not appear to in my experience. I have encountered this when helping design engineers develop algorithms for embedded software. Those algorithms are thoroughly tested without seeing any issues. However, I cannot speak to any theory. As you said, this doesn't seem to be addressed in any of the books.
 

noetsi

No cake for spunky
#5
The problem I have with it being pronounced is that with larger data sets, this one has thousands, I can't see any pattern. :p

The literature I know says that except for non-linearity, regression assumptions impact the standard errors not the effect size. There are issues, such as omitted variable bias or attenuation of the coefficients due to one of the variables having most of its values at one level that I have heard of which do influence effect size, but these are not part of the classical assumptions one reads about.

When I was in graduate school, violation of the regression assumptions seemed a huge deal to me. But as I read more and more it seems that if you have several thousand points (which I usually do) they really aren't that important. There are a lot of flaws in individual methods that still concern me (sometimes a lot such as the issue of time invariance with fixed effects models or nesting) but the classical assumptions no longer seem to be stressed as a major problem if you have enough data.

I think how generalizable your data is, is probably the major issue (especially over time) but that does not come up that much in the literature I read and dealing with it is difficult. Since I deal with populations commonly is not as much a problem for me as researchers (which I am not of course).
 

Miner

TS Contributor
#6
Do you have the ability to store your residuals then generate a plot similar to this where the color represents the density of the plotted points? Some call this a binned scatterplot. Or use a lighter shade of gray on your plotted points where it better shows the density of points?

1663962972442.png