Bernoulli vs Adaboost GBM?

Question

I don't really understand the difference in practical terms of distribution = Adaboost or bernoulli

   library(MASS)
   library(gbm)
   data=Boston
   data$chas = factor(data$chas)
   ada_model = gbm(chas~ . , data, distribution ='adaboost')
   bern_model = gbm(chas ~ . , data, distribution = 'bernoulli')
   ada_model
   bern_model

I don't understand why bernoulli doesn't give any results? I guess I have a fundamental mis-understanding of how this works?

I'm looking for: 1. explanation why bernoulli doesn't work. I thought documentation said this can be used for classification? 2. if they can both be used for classification, what are the practical differences?

Your code works fine for me if I comment out line 4. – Neal Fultz Nov 17 '15 at 23:48 — Neal Fultz, Nov 17 '15 at 23:48
yes, but then it is no longer classification? – runningbirds Nov 18 '15 at 18:40 — runningbirds, Nov 18 '15 at 18:40

score 0 · Answer 1 · answered Nov 18 '15 at 00:10

0

Bernoulli is breaking for you because the factor call recodes the 0/1s to 1/2s:

> str(factor(data$chas[350:400]))
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...

answered Nov 18 '15 at 00:10

Neal Fultz

9,282
1
39
60

How do I fix this then? I can't seem to change them to 0s and 1s? – runningbirds Nov 18 '15 at 18:40

woodvi · Answer 2 · 2016-05-24T22:04:50.583

> str(data$chas)
 int [1:506] 0 0 0 0 0 0 0 0 0 0 ...
> sum(data$chas==0) + sum(data$chas==1)
[1] 506

There are currently 506 integers which are all either zero or one. Nothing to do. Remove line 4 as @Neal Fultz recommended in his original comment and explained in his answer. If you want to explicitly bound the variable to {0,1}, you can use as.logical and your code becomes:

library(MASS)
library(gbm)
data=Boston
data$chas = as.logical(data$chas) # optionally cast as logical to force range into 0 or 1
ada_model = gbm(chas~ . , data, distribution ='adaboost')
bern_model = gbm(chas ~ . , data, distribution = 'bernoulli')
ada_model
bern_model

Reading between the lines a little, I'm guessing that your real problem is that your production dataset has values other than {0,1}. Casting them to logical will convert them to TRUE (1), and you're ready to go. If that's not what you want, then use this to find them and examine them case-by-case:

which((data$chas != 0) & (data$chas != 1))

Bernoulli vs Adaboost GBM?

2 Answers2