# finding the best fit and removing outliers with MARS regression

#### paul15071992

##### New Member
I using the regression method called MARS, in R is it called earth and is located in the package earth, in order to find the best regression model for my datat.

I know that this method is suitable for large data-sets, can handle NA and also decides which variables will be used and which not into the regression.

What I'm doing

After the regression is estimated, I detect the outliers using boxplot and then I eliminate from the data the observations which are extreme values and compute the model again.

I do this until maximum of grsq and rsq are found.

CODE

Code:
    model <- earth(log(price) ~ ., data = data, weights = weights)
max_grsq <- round(model$grsq, digits = 4) max_rsq <- round(model$rsq, digits = 4)
min_diff <- abs(max_grsq - max_rsq)

while(!done) {
residuals_abs <- abs(model$residuals) boxplot <- boxplot(residuals_abs, plot=F) indexes_to_remove <- c(which((residuals_abs > boxplot$stats) == T), which((residuals_abs < boxplot$stats) == T)) if (length(indexes_to_remove) > 0) { data <- data[-indexes_to_remove, ] distances <- distances[-indexes_to_remove] weights <- (1/distances)/(sum(1/distances)) } tempModel <- earth(log(price) ~ ., data = data, weights = weights) temp_grsq <- round(tempModel$grsq, digits = 4)
temp_rsq <- round(tempModel\$rsq, digits = 4)
temp_diff <- abs(temp_grsq - temp_rsq)

if ((temp_grsq > max_grsq && temp_rsq >= max_rsq) || (temp_grsq >= max_grsq && temp_rsq > max_rsq)) {
model <- tempModel
max_grsq <- temp_grsq
max_rsq <- temp_rsq
min_diff <- temp_diff
} else {
done = T
}
}
QUESTION

I'm not a statistician so I don't know any better way for removing the outliers.

- is my approach correct?
- should I use another approach?
- I know that there are bad outliers and good outliers (leverage points), how can I remove only the bad outliers?
- I'm using the semi-log form of the regression. because of the use of dummy variables I can't use the log-log form. Is there any other approach for data transformation? or should I standardize the data? x <- (x - x_min)/(x_max - x_min)

Does anyone has some hints?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
What is the purpose of these analytics? It seems like you may be taking a blind approach (e.g., have a program decide on the model, then eliminating the tails of the distribution to maximize the R^2).

Side note, 5% of your data is always going to be > 2 SD away from the mean! So chopping at the ends to ad nauseum, is a recursive process. What are the limitations to you building this model yourself based on content? Yes, most models have a leverage value for observations.

#### paul15071992

##### New Member
the purpose of the model is prediction.
I want to use the regression in order to estimate the price of a car upon its characteristics.

I stop trimming the data at the moment when the maximum grsq is achieved.