All,
Problem:
I need help to better understand the probability scores that come from the result of a decision tree model. Specifically, I'm using the gbm package from R to create Generalized Boosted Regression Models, but the results I see are common across various ensemble classification models.
My experience has me building models with training data that has various levels of unbalanced class levels. In all cases, I'm dealing with two classes, (e.g. Yes/No) with my data. I typically holdout 10% or more of the training data to score and test the model validity.
Findings:
The closer the training data is to a 50/50 distribution (Example: 800 Yes Records and 800 No records), the probability scores assigned to my test data are higher. In fact, the range of scores typically is between 0.01 to .99. This is what I would expect and what I would like for every model.
However, as the distribution level becomes more unbalanced (lower amount of "Yes" records) to 40%/60% or even to 10%/90%, the probability scores have a lower range and the maximum score is sometimes below 0.5.
As a general rule, if a scored record has a score of 0.5 or greater, then they are predicted to be in the "Yes" class. However, this rule becomes irrelevant if all scored records are below the 0.5 score.
The point of building and scoring an unknown universe is to ultimately, RANK the records from most likely to least likely to belong to the "Yes" class. However, even in the cases where all the scores are less than 0.5, the model does a good job of ranking the records. Is the probability score that is generated irrelevant and I should re-scale?
I have two main issues:
Issue #1 - Are these probability scores meaningful as true probabilities? That is, if a model that is built results in zero test records with scores greater than 0.5, is the model poor? Is it common practice to just re-scale the scores from every model to be from 0-1?
Issue #2 - If I re-distribute my classes to be 50/50 in my training file, I sometimes get more "Yes" records than should be expected. For example, a mailing campaign expects 1-5% responses. When building a model using equal classes (50/50), my scored universe returns 20-25%, sometimes even more expected respondents. When is it appropriate to balance the training class and when is it appropriate to leave unchanged and unbalanced?
My Test Results from 3 Models built from same training data set. The only difference is the class distribution, noted in parenthesis.
Model 1: (50/50)
Model 2: (80/20)
Model 3: (5/95)
Problem:
I need help to better understand the probability scores that come from the result of a decision tree model. Specifically, I'm using the gbm package from R to create Generalized Boosted Regression Models, but the results I see are common across various ensemble classification models.
My experience has me building models with training data that has various levels of unbalanced class levels. In all cases, I'm dealing with two classes, (e.g. Yes/No) with my data. I typically holdout 10% or more of the training data to score and test the model validity.
Findings:
The closer the training data is to a 50/50 distribution (Example: 800 Yes Records and 800 No records), the probability scores assigned to my test data are higher. In fact, the range of scores typically is between 0.01 to .99. This is what I would expect and what I would like for every model.
However, as the distribution level becomes more unbalanced (lower amount of "Yes" records) to 40%/60% or even to 10%/90%, the probability scores have a lower range and the maximum score is sometimes below 0.5.
As a general rule, if a scored record has a score of 0.5 or greater, then they are predicted to be in the "Yes" class. However, this rule becomes irrelevant if all scored records are below the 0.5 score.
The point of building and scoring an unknown universe is to ultimately, RANK the records from most likely to least likely to belong to the "Yes" class. However, even in the cases where all the scores are less than 0.5, the model does a good job of ranking the records. Is the probability score that is generated irrelevant and I should re-scale?
I have two main issues:
Issue #1 - Are these probability scores meaningful as true probabilities? That is, if a model that is built results in zero test records with scores greater than 0.5, is the model poor? Is it common practice to just re-scale the scores from every model to be from 0-1?
Issue #2 - If I re-distribute my classes to be 50/50 in my training file, I sometimes get more "Yes" records than should be expected. For example, a mailing campaign expects 1-5% responses. When building a model using equal classes (50/50), my scored universe returns 20-25%, sometimes even more expected respondents. When is it appropriate to balance the training class and when is it appropriate to leave unchanged and unbalanced?
My Test Results from 3 Models built from same training data set. The only difference is the class distribution, noted in parenthesis.
Model 1: (50/50)
- No Records = 1,200
- Yes Records = 1,200
- Max Scored record = 0.95
- Median Scored record = 0.48
Model 2: (80/20)
- No Records = 1,200
- Yes Records = 300
- Max Scored record = 0.89
- Median Scored record = 0.16
Model 3: (5/95)
- No Records = 1,200
- Yes Records = 64
- Max Scored record = 0.43
- Median Scored record = 0.039
Last edited: