Boosting classification tree in R

Question

I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.

Here is my code:

  #Load packages, set random seed
  library(gbm)
  set.seed(1)

  #Generate random data
  N<-1000
  x<-rnorm(N)
  y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
  z<-rep(0,N)
  for(i in 1:N){
    if(x[i]-y[i]+0.2*rnorm(1)>1.0){
      z[i]=1
    }
  }

  #Create data frame
  myData<-data.frame(x,y,z)

  #Split data set into train and test
  train<-sample(N,800,replace=FALSE)
  test<-(-train)

  #Boosting
  boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
  pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
  pred.boost

pred.boost is a vector with elements from the interval (0,1).

I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".

How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?

abhiieor · Accepted Answer · 2017-03-04T14:57:16.290

1

Your observed behavior is correct. From documentation:

If type="response" then gbm converts back to the same scale as the outcome. Currently the only effect this will have is returning probabilities for bernoulli.

So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.

To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.

edited Mar 04 '17 at 14:57

answered Mar 03 '17 at 09:44

abhiieor

3,132
4
30
47

Thank you for your answer! How about multinomial Y instead of binary Y? In that case, how shall I code to return the label with the highest probability? – yihan Apr 02 '18 at 01:31

score 0 · Answer 2 · answered Sep 14 '17 at 20:46

0

Try using adabag. Class, probabilities, votes and error are inbuilt in adabag which makes it easy to interpret, and of course less lines of codes.

answered Sep 14 '17 at 20:46

S_Dhungel

73
5

Boosting classification tree in R

2 Answers2