schw0516
01-08-2008, 07:54 AM
I am analyzing carbon balance terms as a function of environmental drivers. I have one dependent variable and 5 independent variables. I wish to, using an optimally pruned tree, calculate the percent variation explained by each variable. I have seen this procedure in the literature. And example is: http://www.pnas.org/cgi/content/abstract/104/30/12259
I have corresponded with the authors of the PNAS paper and tried to reverse engineer things but am not having any luck. I generally use Matlab for all data processing. If anyone has any insight here it would be appreciated.
If you want to use CART (Classification and Regression Trees) there are tons of things in R
> http://cran.r-project.org/src/contrib/Views/MachineLearning.html
> http://cran.r-project.org/src/contrib/Views/Multivariate.html
schw0516
01-08-2008, 11:19 AM
I've used R before but it does not calculate variable importance. As an example: If I have some optimally pruned tree, I want to, using the performance measures at each node and overall, to calculate how much variation was explained by X1, X2, etc. So I'll get a % for each variable. R does not do that.
Priya
01-09-2008, 12:01 AM
As you have dependent and independent variables is it possible for you to use regression?Because in regression you will % variation explained by all 5 independent variables as well as by individual variable.See if it is useful for you.
Mike White
01-09-2008, 10:42 AM
In the publication you quoted, the authors use R 2.0.1 and JMPIN 5.2 for the statistical analyses and the classification and regression trees were fitted using the rpart package of R. I presume therefore that they calculated the percent variation for each variable from the rpart output. Maybe the authors would be able to provide details of how the output of R was used. I would certainly be interested to know if you can get further details.
schw0516
01-10-2008, 01:18 PM
I have correspond with the authors. The regression trees were, as you say, done in R using rpart and the percent explanation variable is a function of the improve column in the rpart output. This is all I know and I have not been able to reproduce the exact calculation even for a trivial example. If you have access to R try this:
With this R code:
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit$splits)
The output is a table:
count ncat improve index adj
Start 81 1 6.76232996 8.5 0.0000000
Number 81 -1 2.86679493 5.5 0.0000000
Age 81 -1 2.25021152 39.5 0.0000000
Number 0 -1 0.80246914 6.5 0.1578947
Start 62 1 1.02052786 14.5 0.0000000
Age 62 -1 0.68486352 55.0 0.0000000
Number 62 -1 0.29753321 4.5 0.0000000
Number 0 -1 0.64516129 3.5 0.2413793
Age 0 -1 0.59677419 16.0 0.1379310
Age 33 -1 1.24675325 55.0 0.0000000
Start 33 1 0.28877005 12.5 0.0000000
Number 33 1 0.17532468 3.5 0.0000000
Start 0 -1 0.75757576 9.5 0.3333333
Number 0 1 0.69696970 5.5 0.1666667
Age 21 1 1.71428571 111.0 0.0000000
Start 21 1 0.79365079 12.5 0.0000000
Number 21 1 0.07142857 3.5 0.0000000
To get percent variation explained for each variable I added up the matching improve values in the above table. This is what I gleaned from the paper and emails. For all 3 variables I get:
Age 6.49288819
Number 5.55568152
Start 9.62285442
Total 21.67142413
or 29.96%, 25.64%, 44.40% resp.
Again, this is the extent of the conversation so far. I might have upset the apple cart by asking to see the code used. But, in the end, that's somewhat moot. I merely want to reproduce the calculations as rpart does not generate these % variation explained estimates automatically. If you are curious have a look a DTreg (www.dtreg.com) which calculates variable importance scores, something similar but not quite what I want/need.
Mike White
01-12-2008, 01:02 PM
The attached code calculates the variable effectiveness, as you described, using the 'improve' values from the output of rpart. Hopefully this is what you want.
schw0516
01-28-2008, 01:09 PM
Thanks Mike, I had recreated something similar myself so this check is helpful. It's just that the more I dig into this topic the more I get that the procedure is a real bad idea. I mean, you can do it, it will give your numbers etc. but it has some serious issues. Wish I could say more but I'm still learning and am unsure if there is a simple, statistically robust way to partition variation explained (in a regression sense) to the predictor variables used.