p-values in regression trees

rogojel

TS Contributor
#1
Hi,
I am analyzing some pretty hopeless datasets where the link between the DVs and IVs is quite weak. I observed however that when I take the two groups resulting from the first partition in the tree I can generallly get a nicely low p-value with a t-test . Is this some property of the trees I wonder. Is there any theorem pointing in this direction or is this possibly a weak signal I am detecting?

regards
 

hlsmith

Omega Contributor
#3
Is the program kicking out p-values at your splits (partitions), and you are finding in a certain subgroup the outcome that do significantly differ between groups at the second level?


A pictorial example would be great. Is this coming from a single decision tree that you have run a couple of times?
 

rogojel

TS Contributor
#4
Hi,
yes, it is a single tree and a continuous DV. Imagine that I have the first partition, and I have two subsets , one where the partition condition is TRUE (e.g.Volume>5) and one where the condition is FALSE . If I consider the two subsets and do a t-test for the two subsets like

t.test(dataset[condition,]$dv, dataset[!condition,]$dv)

I always get a low p-value (<0.05). My question is whether this is to be expected, as sort of the normal behavior of partitions or it is something one might consider as a signal?

Now that I think of it, it looks like a case of multiple comparisons.
Regards
 

Dason

Ambassador to the humans
#5
Well typically the tree wouldn't fit the partition if it didn't actually do anything. It's also fit pretty much so that you get the maximal difference between the two groups (that is in essence what the tree is attempting to do...). Now the question of how it compares to an actual linear regression depends on the data itself.
 

rogojel

TS Contributor
#6
I wonder if there could be a theorem (exercise) behind this, something like - if the number of data points is large enough and the points are different enough then the first partition will result in two groups which are significantly different?
 

hlsmith

Omega Contributor
#7
Yes, absolutely there should be a reason behind the split. Usually it is something like an entropy or gini index. What package and procedure are you using? It is like Dason said, it is looking for the split the maximizes the difference between the two groups.