I don't quite understand what you're asking.
Hi,
I am analyzing some pretty hopeless datasets where the link between the DVs and IVs is quite weak. I observed however that when I take the two groups resulting from the first partition in the tree I can generallly get a nicely low p-value with a t-test . Is this some property of the trees I wonder. Is there any theorem pointing in this direction or is this possibly a weak signal I am detecting?
regards
I don't quite understand what you're asking.
I don't have emotions and sometimes that makes me very sad.
Is the program kicking out p-values at your splits (partitions), and you are finding in a certain subgroup the outcome that do significantly differ between groups at the second level?
A pictorial example would be great. Is this coming from a single decision tree that you have run a couple of times?
Stop cowardice, ban guns!
Hi,
yes, it is a single tree and a continuous DV. Imagine that I have the first partition, and I have two subsets , one where the partition condition is TRUE (e.g.Volume>5) and one where the condition is FALSE . If I consider the two subsets and do a t-test for the two subsets like
t.test(dataset[condition,]$dv, dataset[!condition,]$dv)
I always get a low p-value (<0.05). My question is whether this is to be expected, as sort of the normal behavior of partitions or it is something one might consider as a signal?
Now that I think of it, it looks like a case of multiple comparisons.
Regards
Well typically the tree wouldn't fit the partition if it didn't actually do anything. It's also fit pretty much so that you get the maximal difference between the two groups (that is in essence what the tree is attempting to do...). Now the question of how it compares to an actual linear regression depends on the data itself.
I don't have emotions and sometimes that makes me very sad.
rogojel (12-19-2016)
I wonder if there could be a theorem (exercise) behind this, something like - if the number of data points is large enough and the points are different enough then the first partition will result in two groups which are significantly different?
Yes, absolutely there should be a reason behind the split. Usually it is something like an entropy or gini index. What package and procedure are you using? It is like Dason said, it is looking for the split the maximizes the difference between the two groups.
Stop cowardice, ban guns!
The process or measure is called purity and this short article describes the 3 main algorithm (i.e., gini, entropy, and accuracy).
http://people.revoledu.com/kardi/tut...e-impurity.htm
Stop cowardice, ban guns!
Tweet |