How to choose best robust regression model?

#1
Hello.

There are several robust regression methods like LAR-(aka LAV-, LAD-, L1-Norm-)Regression, Quantil-Regression, M-Estimator, ... They are assumed to be especially appropriate for data, that does not fulfill the 5 OLS conditions.

The major part of the robust regression literature (I read) argues abstractly with the breakdownpoint which robust estimator should generally be preferred.

The other part of the robust regression literature (I read) argues, the best robust estimator depends from the next best comparable theoretical distribution. E.g. will the LAR-estimator most probably be the best robust estimator at approximate Laplace distribution (although it has a worse breakdown point than quantile-estimator/ M-estimator).

Question:
How do I choose the best robust regression model from multiple robust estimators for data, that does (graphically obviously) not fullfill the OLS-conditions?
 
Last edited:

rogojel

TS Contributor
#4
hi,
maybe you could just use the generalized least squares with the appropriate variance structure? (package nlme in R)

regards
 
#5
hi,
maybe you could just use the generalized least squares with the appropriate variance structure? (package nlme in R)

regards
I already have parameter estimates from LAR-Regression, Quantile Regression. Of cause, further nlme parameter estimates may be interesting, too.

But my question is, what criteria shows me which robust regression model respectively its estimates is best. R^2 and correlation do not work on robust problems.
 

rogojel

TS Contributor
#6
I already have parameter estimates from LAR-Regression, Quantile Regression. Of cause, further nlme parameter estimates may be interesting, too.

But my question is, what criteria shows me which robust regression model respectively its estimates is best. R^2 and correlation do not work on robust problems.
Hi,
I would use cross validation.

regards
 
#7
But my question is, what criteria shows me which robust regression model respectively its estimates is best.
When someone asks about "the best" one start to think about best by what optimality criterion.

But what is the problem here? Do you simply need to switch distribution, like to gamma distribution or log-normal (skewed and heteroscedastisk)?
 
#9
Concluding from your answers. There is no generally accepted goodness of fit measure for robust regression problems, right? Even if your answer would be "no", this will answer my question for the short term.
 

rogojel

TS Contributor
#10
Concluding from your answers. There is no generally accepted goodness of fit measure for robust regression problems, right?
I think this is a fair statement even for "normal" OLS multiple regression.
I believe that using something like the RMSE measure with cross-validation is the least controversial way to pick a model.´

regards
 
#11
To get this discussion a little bit more fact based, I have written a small R program, that calculates OLS, GLM(with Gamma), LAR and Quantil-Regression parameter estimates on two robust datasets from package robustbase. Further it calculates R^2, Pearson coorrelation, BIC and MSE.

Code:
mse= function(y1, y2)  {
  resid= y1 -y2
  return(colSums(resid^2) /length(y1) )
}


library("robustbase")
library("robust")
library("quantreg")


str(get(data(pension)))
str(get(data(salinity)))

# Select robust dataframe
df= get(data(pension))[ , c(2, 1)]
# df= get(data(salinity))[ , c(2, 4)]

plot( df[ , 1]~ df[ , 2], data = df, cex= .5, col = "blue", xlab = "predictor", ylab = "target")


lmmod= lm(df[ , 1]~ df[ , 2], data= df)
glmgammamod= glm(df[ , 1]~ df[ , 2], data= df, family= Gamma(link = "identity") )
lmrobmod= lmRob(df[ , 1]~ df[ , 2], data= df)
rqmod= rq(df[ , 1]~ df[ , 2], data= df, tau= 0.5)


lm= lmmod$coefficients
glmgamma= glmgammamod$coefficients
lmrob= lmrobmod$coefficients
rq= rqmod$coefficients


# Calc Estimates
coefs= cbind(lm, glmgamma, lmrob, rq)
predictors= matrix( ncol=2, c(rep(1, nrow(df)), df[ , 2]) )
est= predictors %*% coefs


# Goodness of Fit

cor(df[ , 1], est, method= "p")
# Comparison with R^2
summary(lmmod)$r.squared

mse(df[ , 1], est)

BIC(lmmod)
BIC(glmgammamod)
# BIC(lmrobmod) BIC not available
# BIC(rqmod) BIC not plausible

# Bias Test
mean(df[ , 1])
colMeans(est)

# Coefficients
coefs
With the same following results:
R^2 and Pearson-Corelation are indifferent.
BIC is only available for OLS and GLM.
MSE always prefers OLS solution (which however is not plausible, as these are special datasets in favour for robust regression).

I have also testet on other robust datasets. Always the same inplausible results.
 

hlsmith

Omega Contributor
#12
I am familiar that robust reg exists, but have not used it. I find it hard to believe that there aren't better resources for you. I will keep my eyes open in case a fortuitously stumble across some thing.


Could it be possible for you to simulate a dataset very close to your's, with know parameters and assumptions that test all the above approaches?
 
#13
I find it hard to believe that there aren't better resources for you.
Any helpful links are highly appreciated.

Could it be possible for you to simulate a dataset very close to your's, with know parameters and assumptions that test all the above approaches?
I don't know, how to do that. It would be helpful, if (mathematical) guidance was provided how to simulate the datasets, especially how to specify the increasing variance and skewness in the datasets and following how to simulate them. If a clear mathematical concept is layed out, I am pretty confident I can program it in R.
 
Last edited:
#15
Thanks for the robust regression article from regression pope Fox.

As far I could follow the article, it does neither say about a robust goodness of fit measure nor how to reproduce skew residuals (which would be necessary to reproduce the robust regression datasets with known parameters, as you suggested).

However, it solved another problem I had. :-D
 
#16
After I have done some thinking about the problem, I'd say, that Mean Absolute Error (MAE) would be the best model decision criterion for robust data problems, that are miles away from normal distributetd or rather do not follow a theoretic distribution at all.