I have a dataset with a binary outcome. I fitted a two spit classification tree to these data. Both splitting variables were natural binary variables. I then fit a logistic regression model to the dataset based on the tree, so I included the two binary variables and their interaction term. All of which made sense to me and both approaches (classification tree and logistic reg) generated the same ROC curve, which had 3 bends.
Model 1: y = X1 + X2 + X1*X2
Next I collapsed the two predictors into a single variable and the ROC curve naturally changed:
If X1 = 1 then new_var = 1;
If X1 = 0 and X2 = 1 then new_var = 1;
If new_var NE 1 then new_var = 0;
Model 2: y = new_var
The new ROC curve now has one bend. My question stems from both models having the same overall rule for classification, but each has a different curve and accuracy value. The one with more terms seems accurate and should be used to classify with, but the compound rule in the other model gets at the same patients. Which should be used and does anyone have any insights to relax my mind.
Figure: interaction is the logistic model with two main term and their interaction; single term model collapsed rules into a single binary predictor.
Model 1: y = X1 + X2 + X1*X2
Next I collapsed the two predictors into a single variable and the ROC curve naturally changed:
If X1 = 1 then new_var = 1;
If X1 = 0 and X2 = 1 then new_var = 1;
If new_var NE 1 then new_var = 0;
Model 2: y = new_var
The new ROC curve now has one bend. My question stems from both models having the same overall rule for classification, but each has a different curve and accuracy value. The one with more terms seems accurate and should be used to classify with, but the compound rule in the other model gets at the same patients. Which should be used and does anyone have any insights to relax my mind.
Figure: interaction is the logistic model with two main term and their interaction; single term model collapsed rules into a single binary predictor.
Attachments

22.3 KB Views: 1
Last edited: