Comments on residuals

noetsi

Fortran must die
#1
For a major major project and I want to be sure I get this right. I only care about the slope coefficients. I am testing the assumptions with a large population (about 4500 cases). The dependent variable is logged income at closure.

I don't see any obvious heteroskedasticty....although lots of outliers.

1557508033611.png

Not fully normal, although I think its close enough. We have some extreme income which is why the outliers occur. They could be mistakes, but we have no way to know as that is what is reported.

1557508144713.png

I use White SE (with whole populations its uncertain if this matters).

In terms of linearity my understanding is that dummy variables do not have to address this. For the interval predictors these are partial residual plots...

1557508360823.png

This is the one I have the most doubt on (although you can argue this is not interval as it has only 67 distinct levels).

I don't see this as non-linear


1557508449018.png

I have lots of Multicolinearity (values of .98 or higher) but I am reluctant to remove any of them from the model.
 

noetsi

Fortran must die
#3
thanks Jake a lot. One problem is that I can tell from the literature when violations occur. But not when the violations are serious enough to matter. I am not sure with thousands of data points if normality and hetero make a difference because of the asymptotic nature of regression.

I use white's se just in case (which rarely matter).
 

hlsmith

Not a robit
#4
Well, run the model with and without the top and bottom 2.5% values (if you can systematically tease them out and know why they are there. Then see how sensitive the model results are to trimming them. You are typically gonna get some weirdness in the tails of the Q-Q plot. Even if you simulate from a propose normal process you can get slight deviations in tails. I would just test whether the inclusion and exclusion has any real impact on estimates.

Think about in time series analyses, a MA(2) model should only have autocorrelation spikes at lag 1 and 2, but by chance there can be slight additional spikes at other lags due to sampling variability. Same thing here, or if they aren't enough like the others and you know why in your data, you can add an indicator variable to represent them.