Hi,
In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.
For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.
Performances on the test set for a decision threshold of 0.6:
Recall: 50%
Precision: 70%
After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.
The performances on the test set dramatically decreased:
Recall: 20%
Precision: 6%
Does someone know a scientific explanation to this counterintuitive phenomena ?
Thank you for your help,
Marco
In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.
For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.
Performances on the test set for a decision threshold of 0.6:
Recall: 50%
Precision: 70%
After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.
The performances on the test set dramatically decreased:
Recall: 20%
Precision: 6%
Does someone know a scientific explanation to this counterintuitive phenomena ?
Thank you for your help,
Marco