Bagging predictions with binary response variable in R

randomcat

New Member
I am trying to use the bagging technique to increase my model's predictive power. My response variable, status, is a binary variable where 0 indicates no disease and 1 indicates disease. The variable status is just a vector of repeating 0's and 1's (so its class is 'numeric' not 'factor'). Not sure if this is relevant but I just wanted to point that out.

Code:
mod=bagging(status~x1+x2+x3+x4, method="class")
pred=predict(bagging, newdata=fit26data[1:282,], type="Class")
And pred consists of a list of values ranging from 0 to 1
Code:
       > pred
[1] 0.0465 0.3930 0.4426 0.4905...and so on
I'm confused about why pred didn't just return a value of either 0, or 1? Does this have to do with the fact that my response variable, status, is just a vector and not a factor variable? If the predictions are supposed to range from 0 to 1 in this case, what's the cut off point for whether the prediction should be classified as 0, or 1? Would it simply be 0.5?

Lazar

Phineas Packard
type="Class" should be type="class" (no caps). In any case looking up ?predict.regbagg leads me to believe you want aggregation="majority" not type="class".

p.s. Why are you using bagging rather than boosting?

randomcat

New Member
Thanks for the response. I changed my response variable to a factor variable and the predict function did return either 0's or 1's. I noticed that using bagging actually lowered the predictive power (the tree I built using rpart actually predicted more accurately) but I'm not sure why. A professor suggested that I use the bagging technique. I'm not too familiar with boosting, would that be better than bagging in this case?

Lazar

Phineas Packard
Usually the multiple runs are less correlated in boosting than they are in bagging. To try both do the following:

Code:
library(randomForest)
#Bagging
mod=randomForest(status~x1+x2+x3+x4, mtry = 4)#for bagging mtry must equal the number of features you are using
#Boosting
mod=randomForest(status~x1+x2+x3+x4, mtry = 2)#mtry equals some value < the number of feature. I guessed sqrt(n features) as that is generally pretty close to optimal value

Lazar

Phineas Packard
p.s. the package e1071 has functions like tune.randomForest which you can feed a range of guesses for things like mtry and have cross validation pick the best values for you. I have found the square root o the number of feature is pretty good but I can do better by tuning the value with cross validation.