problem with decision tree using rpart: "fit is not a tree, just a root"

#1
I'm doing very basic decision tree practice, but I"m having trouble getting my tree to output. I'm using the rpart function for this.

It's an analysis on 'large' auto accident losses (indicated by a 1 or 0) and using several characteristics of the insurance policy; i,e vehicle year, age, gender, marital status.

first, I do this:
fit <- (largeloss ~ variable1 + variable2 + variable3 + variable4, data=tree, method = "class").

that runs fine.

but when I try to plot(fit), I get error.

what does it mean and where do I start to correct it?
 

rogojel

TS Contributor
#3
first, I do this:
fit <- (largeloss ~ variable1 + variable2 + variable3 + variable4, data=tree, method = "class").

that runs fine.

but when I try to plot(fit), I get error.

what does it mean and where do I start to correct it?
You need to actually call the function rpart :))))
 

rogojel

TS Contributor
#5
Type fit, this will output the tree structure, it would be good to see it. My guess would be that the node is already homogenous and can not be split further e.g. maybe you only have 0-es or only 1s in your data set?
 
#6
Type fit, this will output the tree structure, it would be good to see it. My guess would be that the node is already homogenous and can not be split further e.g. maybe you only have 0-es or only 1s in your data set?
thanks!
1. Does homogenous essentially mean there is not a pattern with my existing dataset?

2. my data has a combination of 1's/0's, integers, and characters. 1's/0's is the indicator I use for whether a there is a "large" loss. I have actual driver's age. I also have classification such as gender and marital status ( M/F, M/S, etc).

3. when I type fit, I get the following:
n= 3018

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 3018 383 0 (0.8730948 0.1269052) *

how do I interpret this result?
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
Homogenous, likely means the distribution of the outcome between the groups does not differ. The program notes can tell you exactly what its criteria was, but it is likely it is similar to a chi-sq 2x2 test.
 

rogojel

TS Contributor
#8
thanks!
1. Does homogenous essentially mean there is not a pattern with my existing dataset?
Yes, it means that there is no variable and split that could produce more homogenous subgroups than the original group.

3. when I type fit, I get the following:
n= 3018

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 3018 383 0 (0.8730948 0.1269052) *

how do I interpret this result?
You have 3018 data points in this hroup, which figures :)
The loss functions value is 383 (this could be the number of wrongly predicted data points), the prediction is a 0,so the wrongly predicted data points might be the 1-s, and the estimate of the probability of having a 0 is between 0.87 and 0.12.

regards
 
#9
Yes, it means that there is no variable and split that could produce more homogenous subgroups than the original group.



You have 3018 data points in this hroup, which figures :)
The loss functions value is 383 (this could be the number of wrongly predicted data points), the prediction is a 0,so the wrongly predicted data points might be the 1-s, and the estimate of the probability of having a 0 is between 0.87 and 0.12.

regards
Thanks.
what is signficant about the # of data points? Is it too low? how many should I have?
So 383 is the number of data points where it predicted a 0 (no large loss), but the actual datapoint is a 1(large loss)? how could this be the case? I"m still using this as my training dataset, and haven't done any actual testing yet. so I thought it is still 'learning' from it.

So in the end, what does this mean for my data? how do I get it so I an output a tree? more variables? less variables? group things together to get a bigger set (i,e group age by age ranges??)
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
rogojel, is it saying homogenous, but that is the null hypothesis (right), because I thought it was looking for heterogeneity and more splits. I thought it would have a split and say two below nodes are different, "heterogenous". Though it presents homogenous because that is the null hypothesis and if it was significant, that justifies the split.
 

rogojel

TS Contributor
#11
hi,
off the top of my head, the algorithm is searching for the variable and the value to split the original group in a such a way that E(G) >E(G1) +E(G2) where E is the entropy and G G1 G2 are the original group and the two split groups. I just mean that if the algorithm can not find any subgroups that have a lower entropy then the original group has the lowest posaible entropy , i.e. it is the most homogenous

regards
 
#12
I just pulled another data set to see if I can create a tree, and I"m still getting the same "fit is not a tree, just a root" message.

This time, I tried a tree on people who dont pay their bills, based on factors such as credit score, age, whether they had previous insurance, etc etc but I'm still getting the error.

Based on my experience, I know there should be some pattern and correlation, but it's not showing. I'm starting to think either my data is set up incorrectly, or something else is going on. I tried to run the titanic data set and that worked fine without issues.

fit <-rpart(cancel~ var1+var2 + var3 + var4 ,data = nsf, method = "class")


n= 19273

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 19273 4557 1 (0.2364448 0.7635552) *
 

rogojel

TS Contributor
#17
hi,
sorry for the late answer, I had no connection for some time. Using your data I did an rpart model with
Code:
tr=rpart(CANCELINDICATOR~., data=dat[,1:7])
and got a normal tree :

> tr
n= 4548

node), split, n, deviance, yval
* denotes terminal node

1) root 4548 805.4292 0.7700088
2) PRIORMONTHS=Y 1482 352.3516 0.6106613 *
3) PRIORMONTHS=N 3066 397.2580 0.8470320
6) NEW_RENEW_INDICATOR=R 998 191.3026 0.7414830 *
7) NEW_RENEW_INDICATOR=N 2068 189.4715 0.8979691 *
>
 
#18
thanks. and just to clarify I understand:
The "~.," means I want to use all the variables, but "dat[,1:7]" restricts to the first 7 columns of my data right? Also, I notice that I get single node when I use method = 'class'. I thought that since my CANCELINDICATOR is pretty much a yes/no indicator, I need to notate the method.

withtout saying method = 'class,' what is the result telling me?
 
Last edited:

rogojel

TS Contributor
#19
hi,
the andwer is yes to the first question.

The second part is very interesting. Afaik, but I am not sure, the difference between method "class" and "anova" is only how the homogenity of the subgroups will be calculated. Method "anova" will treat the dv as a continuous variable and calculate the variance or some similar measure, while method "class" considers the DV as discrete and calculates the entropy. I was not aware that this can lead to such huge differences.