I have a dataset with a binary outcome. I fitted a two spit classification tree. Both splitting variables were natural binary variables. I then fit a logistic regression model to the dataset based on the tree, so I included the two binary variables and their interaction term. All of which made sense to me and both approaches generated the same ROC curve, with 3 bends.
Model 1: y = X1 + X2 + X1*X2
Next I collapsed the two predictors into a single variable and the ROC curve naturally changed:
If X1 = 1 then new_var = 1;
If X1 = 0 and X2 = 1 then new_var = 1;
If new_var NE 1 then new_var = 0;
Model 2: y = new_var
The new ROC curve now has one bend. My question stems on both models have the same overall rule for classification, but have different curves and accuracy values. The one with more terms seems accurate and should be used to classify with, but the compound rule in the other model gets at the same patients. Which should be used and does anyone have any insights to relax my mind.
Figure: interaction is the logistic model with two main term and their interaction; single term model collapse the rule into a single predictor.
Model 1: y = X1 + X2 + X1*X2
Next I collapsed the two predictors into a single variable and the ROC curve naturally changed:
If X1 = 1 then new_var = 1;
If X1 = 0 and X2 = 1 then new_var = 1;
If new_var NE 1 then new_var = 0;
Model 2: y = new_var
The new ROC curve now has one bend. My question stems on both models have the same overall rule for classification, but have different curves and accuracy values. The one with more terms seems accurate and should be used to classify with, but the compound rule in the other model gets at the same patients. Which should be used and does anyone have any insights to relax my mind.
Figure: interaction is the logistic model with two main term and their interaction; single term model collapse the rule into a single predictor.
Attachments

22.3 KB Views: 1