# Thread: p-values in regression trees

1. ## p-values in regression trees

Hi,
I am analyzing some pretty hopeless datasets where the link between the DVs and IVs is quite weak. I observed however that when I take the two groups resulting from the first partition in the tree I can generallly get a nicely low p-value with a t-test . Is this some property of the trees I wonder. Is there any theorem pointing in this direction or is this possibly a weak signal I am detecting?

regards

2. ## Re: p-values in regression trees

I don't quite understand what you're asking.

3. ## Re: p-values in regression trees

Is the program kicking out p-values at your splits (partitions), and you are finding in a certain subgroup the outcome that do significantly differ between groups at the second level?

A pictorial example would be great. Is this coming from a single decision tree that you have run a couple of times?

4. ## Re: p-values in regression trees

Hi,
yes, it is a single tree and a continuous DV. Imagine that I have the first partition, and I have two subsets , one where the partition condition is TRUE (e.g.Volume>5) and one where the condition is FALSE . If I consider the two subsets and do a t-test for the two subsets like

t.test(dataset[condition,]\$dv, dataset[!condition,]\$dv)

I always get a low p-value (<0.05). My question is whether this is to be expected, as sort of the normal behavior of partitions or it is something one might consider as a signal?

Now that I think of it, it looks like a case of multiple comparisons.
Regards

5. ## Re: p-values in regression trees

Well typically the tree wouldn't fit the partition if it didn't actually do anything. It's also fit pretty much so that you get the maximal difference between the two groups (that is in essence what the tree is attempting to do...). Now the question of how it compares to an actual linear regression depends on the data itself.

6. ## The Following User Says Thank You to Dason For This Useful Post:

rogojel (12-19-2016)

7. ## Re: p-values in regression trees

I wonder if there could be a theorem (exercise) behind this, something like - if the number of data points is large enough and the points are different enough then the first partition will result in two groups which are significantly different?

8. ## Re: p-values in regression trees

Yes, absolutely there should be a reason behind the split. Usually it is something like an entropy or gini index. What package and procedure are you using? It is like Dason said, it is looking for the split the maximizes the difference between the two groups.

9. ## Re: p-values in regression trees

The process or measure is called purity and this short article describes the 3 main algorithm (i.e., gini, entropy, and accuracy).

http://people.revoledu.com/kardi/tut...e-impurity.htm

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts