Splitting obs. based on DV class for separate random forest runs

#1
I am working with a relatively large (n=162,000) data set on a land use classification problem using random forest. I have four land use classes (young forest, older forest, agriculture, and other) I am trying to predict from land cover and other IVs. While important, the young forest is very small compared to the other classes. One of the IV land cover classes used to predict the DV"young forest" is "grass/shrub". Grass/shrub is also very small (n=~12,000) compared compared to the other land cover observations. Running RF on the full data set results in OK Cohen Kappa scores (~.80) overall but the producer accuracy and user accuracy (also called "sensitivity" and "positive predicted value" in the R package "caret") are very low (~.20) for the young forest land use predictions.

It has been suggested I split the observations so the IV land cover class "grass/shrub" has a separate run from the rest of the observations as an extension of both stratification and ensemble principles. I had done this and it showed improved producer and user accuracies for both older and young forest (not tremendous but enough that I will use this as the model form).

My question is, as I am not an expert on this subject, is there any literature out there that used a similar process? I would like to see how others approached this and reported their results. I am not even sure what to call the search for such a procedure. I have tried "splitting observations for separate random forest runs" but that only leads into discussions about node splitting, etc. Any leads would be very welcome - many thanks in advance!!! jaj
 
Last edited:

hlsmith

Omega Contributor
#2
This process is not exclusive to random forests. Many time analysts will find themselves breaking off subsets of the dataset for many reasons, including accuracy.


I am not well versed in Random Forests, you are probably better than I am. Though, the approach seems acceptable. I work with medical data, and at times, some subgroups of patients may have different predictors or the predictors function differently in certain patients in the sample. Splitting the sample for these patients can make sense, since they are different (e.g., have a co-morbidity, age affects the impact differently, etc.). The thing you have to remember is how are you going to interpret and use these results in the future. Writing up your current results isn't hard, just describe the process and call it a post hoc subsampling to improve model accuracy. You can also, present the results both ways in the result (subsetted and not subsetted). Just remember, splitting may be a good idea given your sample - though it could hinder the interpretability and use of generalizing results to other samples that don't have the nuance differences you saw - thus you are overfitting.


I don't know if there is a good name or term for it. I would say, subsampling, subsettinng or dichotomizing/splitting if you were doing this to a continuous variable.