2

The bagging wrapper seems to give strange results. If I apply it to a simple logistic regression then the logloss is amplyfied by a factor of 10:

library(mlbench)
library(mlr)

data(PimaIndiansDiabetes)

trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")

bagged.lrn = makeBaggingWrapper(makeLearner("classif.logreg"), bw.iters = 10, bw.replace = TRUE, bw.size = 0.8, bw.feats = 1)
bagged.lrn = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")

rdesc = makeResampleDesc("CV", iters = 5L)

resample(learner = non.bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)
resample(learner = bagged.lrn, task = trainTask1, resampling = rdesc, show.info = FALSE,measures = logloss)

gives

Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg
logloss.aggr: 0.49
logloss.mean: 0.49
logloss.sd: 0.02
Runtime: 0.0699999

for the first learner and

Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 5.41
logloss.mean: 5.41
logloss.sd: 0.80

Runtime: 0.645

for the bagged one. Thus the performance of the bagged one is much worse. Is there a bug or did I do something wrong?

This is my sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlr_2.9          stringi_1.1.1    ParamHelpers_1.8 ggplot2_2.1.0    BBmisc_1.10      mlbench_2.1-1   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6      magrittr_1.5     splines_3.3.1    munsell_0.4.3    lattice_0.20-33  xtable_1.8-2     colorspace_1.2-6
 [8] R6_2.1.2         plyr_1.8.4       dplyr_0.5.0      tools_3.3.1      parallel_3.3.1   grid_3.3.1       checkmate_1.8.1 
[15] data.table_1.9.6 gtable_0.2.0     DBI_0.4-1        htmltools_0.3.5  ggvis_0.4.3      survival_2.39-4  assertthat_0.1  
[22] digest_0.6.9     tibble_1.1       Matrix_1.2-6     shiny_0.13.2     mime_0.5         parallelMap_1.3  scales_0.4.0    
[29] backports_1.0.3  httpuv_1.3.3     chron_2.3-47    
Richi W
  • 3,534
  • 4
  • 20
  • 39
  • Did you use any particular seed? It's not a big deal, but without setting the seed there will be some random variation in the results you got versus what we'll get when we run it. – Hack-R Sep 23 '16 at 13:44
  • Right, I should have used a seed. Thanks for this comment. Still the error is 10 times as large as without bagging therefore I assume that there is a bug. Just wanted to ask here first. – Richi W Sep 23 '16 at 13:57
  • No problem. It's not a bug, see my answer below. Cheers. – Hack-R Sep 23 '16 at 13:59

1 Answers1

3

There's not necessarily anything wrong with this result, though the bagging model could be better specified.

Bagging doesn't necessarily always give you better performance statistics, rather it helps you avoid overfitting and improves accuracy.

Thus the reason that your non-bagging model has better performance statistics may simply be that it's overfitting or otherwise producing a more biased result with misleading performance statistics.

However, here's a much improved specification of the bagging model that gets the average logloss down by 70%:

pacman::p_load(mlbench,mlr)

data(PimaIndiansDiabetes)
set.seed(1)

trainTask1 <- makeClassifTask(data = PimaIndiansDiabetes,target = "diabetes",positive = "pos")

bagged.lrn     = makeBaggingWrapper(makeLearner("classif.logreg"), 
                                    bw.iters = 100, 
                                    bw.replace = TRUE, 
                                    bw.size = .6, 
                                    bw.feats = .5)
bagged.lrn     = setPredictType(bagged.lrn,"prob")
non.bagged.lrn = setPredictType(makeLearner("classif.logreg"),"prob")

rdesc = makeResampleDesc("CV", iters = 10L)

resample(learner    = non.bagged.lrn, 
         task       = trainTask1, 
         resampling = rdesc, 
         show.info  = T,
         measures   = logloss)


resample(learner    = bagged.lrn, 
         task       = trainTask1, 
         resampling = rdesc, 
         show.info  = T, 
         measures   = logloss)

where the key result is

Resample Result
Task: PimaIndiansDiabetes
Learner: classif.logreg.bagged
logloss.aggr: 1.65
logloss.mean: 1.65
logloss.sd: 0.90
Runtime: 14.0544
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • thanks for this answer. I just wonder ... bw.iteres = 100 is quite a lot I would say (for practical applications with large data sets). Maybe the example set is too small. The data set to which I apply it is much bigger and there I tink bagging with more than 10 iterations is nearly infeasible .. – Richi W Sep 23 '16 at 14:14
  • But yes . maybe my application is bad as I would rather use bagging in combination with little bias but high variance .. in the logistic regression case I might have a larger bias and smaller variance ... thus bagging could harm in terms of bias. ... thanks – Richi W Sep 23 '16 at 14:17
  • 1
    @Richard With 10 iterations you can still get it down by at least 50%, but I typically use this many with Big Data. I just offset the size of the data with parallelization. So if you have 20 columns and 200,000 rows you might want to run it on a server with 50 cores or so, for example (and use either a parallelization package for your OS or take advantage of built-in multicore options if available -- idk about this package but `caret` has them). There are also other Big Data strategies you can take advantage of like chunking, sampling, feature engineering, etc, etc. – Hack-R Sep 23 '16 at 14:38
  • 1
    Yes, mlr has parallelization -- see [the tutorial](https://mlr-org.github.io/mlr-tutorial/devel/html/parallelization/index.html). For things like `bw.iters` you can use mlr's built-in parameter tuning (see [the tutorial](https://mlr-org.github.io/mlr-tutorial/devel/html/tune/index.html) to determine the best value for your data automatically. – Lars Kotthoff Sep 23 '16 at 16:06