Variable importance in random forest

#1
Hi,

In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.

For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.

Performances on the test set for a decision threshold of 0.6:

Recall: 50%
Precision: 70%

After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.

The performances on the test set dramatically decreased:

Recall: 20%
Precision: 6%

Does someone know a scientific explanation to this counterintuitive phenomena ?

Thank you for your help,
Marco
 

hlsmith

Not a robit
#2
Is the training set a random sample of the training plus testing set?


Decision threshold = 0.60, means one group in the terminal nodes had to have 60% of the branches observations in it?
 
#3
The full data set contains 500,000 observations. It is a rare class problem with the target event representing 0.05% of the full data set.

First, I use stratified random sampling to constitute the test set in order to preserve the target event/non event proportions.

To constitute the training set, I keep apart the test set and use simple random sampling.

All training sets have the same event observations since I kept all event observations except the ones in the test set. So the simple random sampling is applied only on non event observations. The target event proportion in all the training sets is 5%.

To sum up, the training set is NOT a simple random sample of the full data set. The test set is kept apart before sampling for the training set.


Finally, suppose Y = 1 is the target event and Y = 0 is the target non event.

Then, the decision threshold is just: for t in (0,1) arbitrarily fixed Y^ = 1 if and only if P(Y = 1 | X) > t.

Is it all clear ?

Thank you for your help,
Marco
 
Last edited:

hlsmith

Not a robit
#4
Well if you are now using a subset of variables the precision, etc. well be different in the testing phase. What did the var imp look like in the training phase?
 
#5
Yes, sure. But the problem is : how could the decrease in performance be so drastic? If you select the top most important variables, how could it be possible that the performances decrease like this ?

Something I really cannot explain. A little decrease okay, but a big difference as above .. I cannot imagine why.

Do you have an idea ? Maybe, is it due to this particular algorithm ?

Thank you,
Marco
 

rogojel

TS Contributor
#6
hi,
on the blind my guess would be that your variables are not really capturing the effect. I mean, it looks like there are no variables that stand out in terms of importance. I guess, if you trained on a different sample you would get other variables in the most important 20.

regards
 

hlsmith

Not a robit
#7
Did the outcome group get switched or the outcome variable get switched.

I agree that is a large discrepancy.

I wouldn't use the following results, but to make sure there isn't a coding issue perhaps play around with slowly increasing the training sample size.