0

I don't really understand the difference in practical terms of distribution = Adaboost or bernoulli

   library(MASS)
   library(gbm)
   data=Boston
   data$chas = factor(data$chas)
   ada_model = gbm(chas~ . , data, distribution ='adaboost')
   bern_model = gbm(chas ~ . , data, distribution = 'bernoulli')
   ada_model
   bern_model

I don't understand why bernoulli doesn't give any results? I guess I have a fundamental mis-understanding of how this works?

I'm looking for: 1. explanation why bernoulli doesn't work. I thought documentation said this can be used for classification? 2. if they can both be used for classification, what are the practical differences?

runningbirds
  • 6,235
  • 13
  • 55
  • 94

2 Answers2

0

Bernoulli is breaking for you because the factor call recodes the 0/1s to 1/2s:

> str(factor(data$chas[350:400]))
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
0
> str(data$chas)
 int [1:506] 0 0 0 0 0 0 0 0 0 0 ...
> sum(data$chas==0) + sum(data$chas==1)
[1] 506

There are currently 506 integers which are all either zero or one. Nothing to do. Remove line 4 as @Neal Fultz recommended in his original comment and explained in his answer. If you want to explicitly bound the variable to {0,1}, you can use as.logical and your code becomes:

library(MASS)
library(gbm)
data=Boston
data$chas = as.logical(data$chas) # optionally cast as logical to force range into 0 or 1
ada_model = gbm(chas~ . , data, distribution ='adaboost')
bern_model = gbm(chas ~ . , data, distribution = 'bernoulli')
ada_model
bern_model

Reading between the lines a little, I'm guessing that your real problem is that your production dataset has values other than {0,1}. Casting them to logical will convert them to TRUE (1), and you're ready to go. If that's not what you want, then use this to find them and examine them case-by-case:

which((data$chas != 0) & (data$chas != 1))
woodvi
  • 1,898
  • 21
  • 27