+ Reply to Thread
Results 1 to 7 of 7

Thread: Variable importance in random forest

  1. #1
    Points: 23, Level: 1
    Level completed: 45%, Points required for next Level: 27

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Variable importance in random forest




    Hi,

    In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.

    For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.

    Performances on the test set for a decision threshold of 0.6:

    Recall: 50%
    Precision: 70%

    After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.

    The performances on the test set dramatically decreased:

    Recall: 20%
    Precision: 6%

    Does someone know a scientific explanation to this counterintuitive phenomena ?

    Thank you for your help,
    Marco

  2. #2
    Omega Contributor
    Points: 39,022, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,069
    Thanks
    402
    Thanked 1,192 Times in 1,153 Posts

    Re: Variable importance in random forest

    Is the training set a random sample of the training plus testing set?


    Decision threshold = 0.60, means one group in the terminal nodes had to have 60% of the branches observations in it?
    Stop cowardice, ban guns!

  3. #3
    Points: 23, Level: 1
    Level completed: 45%, Points required for next Level: 27

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Variable importance in random forest

    The full data set contains 500,000 observations. It is a rare class problem with the target event representing 0.05% of the full data set.

    First, I use stratified random sampling to constitute the test set in order to preserve the target event/non event proportions.

    To constitute the training set, I keep apart the test set and use simple random sampling.

    All training sets have the same event observations since I kept all event observations except the ones in the test set. So the simple random sampling is applied only on non event observations. The target event proportion in all the training sets is 5%.

    To sum up, the training set is NOT a simple random sample of the full data set. The test set is kept apart before sampling for the training set.


    Finally, suppose Y = 1 is the target event and Y = 0 is the target non event.

    Then, the decision threshold is just: for t in (0,1) arbitrarily fixed Y^ = 1 if and only if P(Y = 1 | X) > t.

    Is it all clear ?

    Thank you for your help,
    Marco
    Last edited by MarcoVA; 10-22-2017 at 07:45 AM.

  4. #4
    Omega Contributor
    Points: 39,022, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,069
    Thanks
    402
    Thanked 1,192 Times in 1,153 Posts

    Re: Variable importance in random forest

    Well if you are now using a subset of variables the precision, etc. well be different in the testing phase. What did the var imp look like in the training phase?
    Stop cowardice, ban guns!

  5. #5
    Points: 23, Level: 1
    Level completed: 45%, Points required for next Level: 27

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Variable importance in random forest

    Yes, sure. But the problem is : how could the decrease in performance be so drastic? If you select the top most important variables, how could it be possible that the performances decrease like this ?

    Something I really cannot explain. A little decrease okay, but a big difference as above .. I cannot imagine why.

    Do you have an idea ? Maybe, is it due to this particular algorithm ?

    Thank you,
    Marco

  6. #6
    TS Contributor
    Points: 12,501, Level: 73
    Level completed: 13%, Points required for next Level: 349
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,491
    Thanks
    162
    Thanked 334 Times in 314 Posts

    Re: Variable importance in random forest

    hi,
    on the blind my guess would be that your variables are not really capturing the effect. I mean, it looks like there are no variables that stand out in terms of importance. I guess, if you trained on a different sample you would get other variables in the most important 20.

    regards

  7. #7
    Omega Contributor
    Points: 39,022, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,069
    Thanks
    402
    Thanked 1,192 Times in 1,153 Posts

    Re: Variable importance in random forest


    Did the outcome group get switched or the outcome variable get switched.

    I agree that is a large discrepancy.

    I wouldn't use the following results, but to make sure there isn't a coding issue perhaps play around with slowly increasing the training sample size.
    Stop cowardice, ban guns!

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats