View Full Version : Variable importance using regression trees


schw0516
01-08-2008, 07:54 AM
I am analyzing carbon balance terms as a function of environmental drivers. I have one dependent variable and 5 independent variables. I wish to, using an optimally pruned tree, calculate the percent variation explained by each variable. I have seen this procedure in the literature. And example is: http://www.pnas.org/cgi/content/abstract/104/30/12259

I have corresponded with the authors of the PNAS paper and tried to reverse engineer things but am not having any luck. I generally use Matlab for all data processing. If anyone has any insight here it would be appreciated.

mp83
01-08-2008, 11:07 AM
If you want to use CART (Classification and Regression Trees) there are tons of things in R

> http://cran.r-project.org/src/contrib/Views/MachineLearning.html
> http://cran.r-project.org/src/contrib/Views/Multivariate.html

schw0516
01-08-2008, 11:19 AM
I've used R before but it does not calculate variable importance. As an example: If I have some optimally pruned tree, I want to, using the performance measures at each node and overall, to calculate how much variation was explained by X1, X2, etc. So I'll get a % for each variable. R does not do that.

mp83
01-08-2008, 12:11 PM
Hm...You got me!

Priya
01-09-2008, 12:01 AM
As you have dependent and independent variables is it possible for you to use regression?Because in regression you will % variation explained by all 5 independent variables as well as by individual variable.See if it is useful for you.

Mike White
01-09-2008, 10:42 AM
In the publication you quoted, the authors use R 2.0.1 and JMPIN 5.2 for the statistical analyses and the classification and regression trees were fitted using the rpart package of R. I presume therefore that they calculated the percent variation for each variable from the rpart output. Maybe the authors would be able to provide details of how the output of R was used. I would certainly be interested to know if you can get further details.

schw0516
01-10-2008, 01:18 PM
I have correspond with the authors. The regression trees were, as you say, done in R using rpart and the percent explanation variable is a function of the improve column in the rpart output. This is all I know and I have not been able to reproduce the exact calculation even for a trivial example. If you have access to R try this:

With this R code:
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit$splits)

The output is a table:
count ncat improve index adj
Start 81 1 6.76232996 8.5 0.0000000
Number 81 -1 2.86679493 5.5 0.0000000
Age 81 -1 2.25021152 39.5 0.0000000
Number 0 -1 0.80246914 6.5 0.1578947
Start 62 1 1.02052786 14.5 0.0000000
Age 62 -1 0.68486352 55.0 0.0000000
Number 62 -1 0.29753321 4.5 0.0000000
Number 0 -1 0.64516129 3.5 0.2413793
Age 0 -1 0.59677419 16.0 0.1379310
Age 33 -1 1.24675325 55.0 0.0000000
Start 33 1 0.28877005 12.5 0.0000000
Number 33 1 0.17532468 3.5 0.0000000
Start 0 -1 0.75757576 9.5 0.3333333
Number 0 1 0.69696970 5.5 0.1666667
Age 21 1 1.71428571 111.0 0.0000000
Start 21 1 0.79365079 12.5 0.0000000
Number 21 1 0.07142857 3.5 0.0000000

To get percent variation explained for each variable I added up the matching improve values in the above table. This is what I gleaned from the paper and emails. For all 3 variables I get:

Age 6.49288819
Number 5.55568152
Start 9.62285442
Total 21.67142413

or 29.96%, 25.64%, 44.40% resp.

Again, this is the extent of the conversation so far. I might have upset the apple cart by asking to see the code used. But, in the end, that's somewhat moot. I merely want to reproduce the calculations as rpart does not generate these % variation explained estimates automatically. If you are curious have a look a DTreg (www.dtreg.com) which calculates variable importance scores, something similar but not quite what I want/need.

Mike White
01-12-2008, 01:02 PM
The attached code calculates the variable effectiveness, as you described, using the 'improve' values from the output of rpart. Hopefully this is what you want.

schw0516
01-28-2008, 01:09 PM
Thanks Mike, I had recreated something similar myself so this check is helpful. It's just that the more I dig into this topic the more I get that the procedure is a real bad idea. I mean, you can do it, it will give your numbers etc. but it has some serious issues. Wish I could say more but I'm still learning and am unsure if there is a simple, statistically robust way to partition variation explained (in a regression sense) to the predictor variables used.