# How to choose best robust regression model?

#### consuli

##### Member
Hello.

There are several robust regression methods like LAR-(aka LAV-, LAD-, L1-Norm-)Regression, Quantil-Regression, M-Estimator, ... They are assumed to be especially appropriate for data, that does not fulfill the 5 OLS conditions.

The major part of the robust regression literature (I read) argues abstractly with the breakdownpoint which robust estimator should generally be preferred.

The other part of the robust regression literature (I read) argues, the best robust estimator depends from the next best comparable theoretical distribution. E.g. will the LAR-estimator most probably be the best robust estimator at approximate Laplace distribution (although it has a worse breakdown point than quantile-estimator/ M-estimator).

Question:
How do I choose the best robust regression model from multiple robust estimators for data, that does (graphically obviously) not fullfill the OLS-conditions?

Last edited:

#### Dason

Which conditions are being violated?

#### consuli

##### Member
Which conditions are being violated?
In robust regression problems - especially in my one - the constant variance assumption is heavily violated in combination with skew residuals.

#### rogojel

##### TS Contributor
hi,
maybe you could just use the generalized least squares with the appropriate variance structure? (package nlme in R)

regards

#### consuli

##### Member
hi,
maybe you could just use the generalized least squares with the appropriate variance structure? (package nlme in R)

regards
I already have parameter estimates from LAR-Regression, Quantile Regression. Of cause, further nlme parameter estimates may be interesting, too.

But my question is, what criteria shows me which robust regression model respectively its estimates is best. R^2 and correlation do not work on robust problems.

#### rogojel

##### TS Contributor
I already have parameter estimates from LAR-Regression, Quantile Regression. Of cause, further nlme parameter estimates may be interesting, too.

But my question is, what criteria shows me which robust regression model respectively its estimates is best. R^2 and correlation do not work on robust problems.
Hi,
I would use cross validation.

regards

#### GretaGarbo

##### Human
But my question is, what criteria shows me which robust regression model respectively its estimates is best.

But what is the problem here? Do you simply need to switch distribution, like to gamma distribution or log-normal (skewed and heteroscedastisk)?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Good point Greta. OP, can we see what this data looks like or the residuals? Thanks.

#### consuli

##### Member
Concluding from your answers. There is no generally accepted goodness of fit measure for robust regression problems, right? Even if your answer would be "no", this will answer my question for the short term.

#### rogojel

##### TS Contributor
Concluding from your answers. There is no generally accepted goodness of fit measure for robust regression problems, right?
I think this is a fair statement even for "normal" OLS multiple regression.
I believe that using something like the RMSE measure with cross-validation is the least controversial way to pick a model.´

regards

#### consuli

##### Member
To get this discussion a little bit more fact based, I have written a small R program, that calculates OLS, GLM(with Gamma), LAR and Quantil-Regression parameter estimates on two robust datasets from package robustbase. Further it calculates R^2, Pearson coorrelation, BIC and MSE.

Code:
mse= function(y1, y2)  {
resid= y1 -y2
return(colSums(resid^2) /length(y1) )
}

library("robustbase")
library("robust")
library("quantreg")

str(get(data(pension)))
str(get(data(salinity)))

# Select robust dataframe
df= get(data(pension))[ , c(2, 1)]
# df= get(data(salinity))[ , c(2, 4)]

plot( df[ , 1]~ df[ , 2], data = df, cex= .5, col = "blue", xlab = "predictor", ylab = "target")

lmmod= lm(df[ , 1]~ df[ , 2], data= df)
glmgammamod= glm(df[ , 1]~ df[ , 2], data= df, family= Gamma(link = "identity") )
lmrobmod= lmRob(df[ , 1]~ df[ , 2], data= df)
rqmod= rq(df[ , 1]~ df[ , 2], data= df, tau= 0.5)

lm= lmmod$coefficients glmgamma= glmgammamod$coefficients
lmrob= lmrobmod$coefficients rq= rqmod$coefficients

# Calc Estimates
coefs= cbind(lm, glmgamma, lmrob, rq)
predictors= matrix( ncol=2, c(rep(1, nrow(df)), df[ , 2]) )
est= predictors %*% coefs

# Goodness of Fit

cor(df[ , 1], est, method= "p")
# Comparison with R^2
summary(lmmod)\$r.squared

mse(df[ , 1], est)

BIC(lmmod)
BIC(glmgammamod)
# BIC(lmrobmod) BIC not available
# BIC(rqmod) BIC not plausible

# Bias Test
mean(df[ , 1])
colMeans(est)

# Coefficients
coefs
With the same following results:
R^2 and Pearson-Corelation are indifferent.
BIC is only available for OLS and GLM.
MSE always prefers OLS solution (which however is not plausible, as these are special datasets in favour for robust regression).

I have also testet on other robust datasets. Always the same inplausible results.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
I am familiar that robust reg exists, but have not used it. I find it hard to believe that there aren't better resources for you. I will keep my eyes open in case a fortuitously stumble across some thing.

Could it be possible for you to simulate a dataset very close to your's, with know parameters and assumptions that test all the above approaches?

#### consuli

##### Member
I find it hard to believe that there aren't better resources for you.

Could it be possible for you to simulate a dataset very close to your's, with know parameters and assumptions that test all the above approaches?
I don't know, how to do that. It would be helpful, if (mathematical) guidance was provided how to simulate the datasets, especially how to specify the increasing variance and skewness in the datasets and following how to simulate them. If a clear mathematical concept is layed out, I am pretty confident I can program it in R.

Last edited:

#### consuli

##### Member
Thanks for the robust regression article from regression pope Fox.

As far I could follow the article, it does neither say about a robust goodness of fit measure nor how to reproduce skew residuals (which would be necessary to reproduce the robust regression datasets with known parameters, as you suggested).

However, it solved another problem I had. :-D

#### consuli

##### Member
After I have done some thinking about the problem, I'd say, that Mean Absolute Error (MAE) would be the best model decision criterion for robust data problems, that are miles away from normal distributetd or rather do not follow a theoretic distribution at all.