2

I'm fighting with this issue for an embarrassingly long time. I feel like an absolute cretin, as the answer is probably painfully obvious, but I can not find a single thread that explains how to do this.

Documentation part about custom model creation feels for me like this. I feel like somewhere during my education I missed some very specific class, that everybody remembers now, but me, because all I find is "yea, just create a custom model, and done".

Actual questions here:

I want to get predictions for every single iteration of gbm in caret. In gbm I can just use n.trees in predict(..., n.trees = 1:100) for example, and it's done.

In caret apparently for I need to use something called sub-models trick, which means - if I understand correctly - that I have to create my own custom model.

But I can see in getModelInfo('gbm'), that there is some kind of loop function!

$gbm$loop
function (grid) 
{
    loop <- plyr::ddply(grid, c("shrinkage", "interaction.depth", 
        "n.minobsinnode"), function(x) c(n.trees = max(x$n.trees)))
    submodels <- vector(mode = "list", length = nrow(loop))
    for (i in seq(along = loop$n.trees)) {
        index <- which(grid$interaction.depth == loop$interaction.depth[i] & 
            grid$shrinkage == loop$shrinkage[i] & grid$n.minobsinnode == 
            loop$n.minobsinnode[i])
        trees <- grid[index, "n.trees"]
        submodels[[i]] <- data.frame(n.trees = trees[trees != 
            loop$n.trees[i]])
    }
    list(loop = loop, submodels = submodels)

How do I use that? Why is it not working by default? Do I actually need to create a custom model - or maybe not?

Disclaimer 1: I do not want to use any crossvalidation. I just want to pull out predictions, for every iteration of a single gbm run.

Disclaimer 2: I don't want to use predict.gbm() on $finalModel, as I want to also test some other algorithms, which also make use of that sub-model trick. I do not want to use all the different algorithm specific predict() functions, because then why should I even bother with caret.

I do not even know what should I put as a replicable example. There is no problem with the code. I just have no idea how this thing is supposed to work.

petezurich
  • 9,280
  • 9
  • 43
  • 57
M. Ike
  • 21
  • 2
  • So you would like predictions on the training data for each tree? What would be the point of this? I might be able to help if you would like to pull cross validated/bootstrap predictions for each tree without creating a custom model. From my knowledge caret does not make it easy to pull train predictions for any model since they mean very little. – missuse Aug 27 '18 at 09:26
  • @missuse Pretty much that, but I wish to get predictions for each tree for both training and test data, to later create learning curves for presentation purposes. I want to have full control over the parameters, and I want to reduce all black box type elements as much as possible, so as of now I’m not interested in cross validation options. I also do not understand why caret makes that so hard to achieve. That functionality seems to me rather important for comparing different algorithms performance per time on presentations. – M. Ike Aug 27 '18 at 09:40
  • algorithm performance can not be estimated on the train data, this is most likely the reason why caret's author chose not to provide easy access to train data predictions. – missuse Aug 27 '18 at 09:46
  • @missuse I specified I wish to also (and mainly) pull predictions for test data. – M. Ike Aug 27 '18 at 09:48

1 Answers1

0

Here is an example on how to pull the desired predictions for the test data for each tree:

library(caret)
library(mlbench) #for the data set
data(Sonar) #some data set I always use on stack overflow

res <- train(Class~.,
             data = Sonar,
             method = "gbm",
             trControl = trainControl(method = "cv", #some evaluations scheme
                                      number = 5,
                                      savePredictions = "all"), #tell caret you would like to save all,
             tuneGrid = expand.grid(shrinkage = 0.01,
                                    interaction.depth = 2, 
                                    n.minobsinnode = 10,
                                    n.trees = 1:100)) #some random values and all the trees

res$pred #results are stored in here

Basically the code you are showing in the post tells caret not to tune all n.tree models but rather just tune the one with max(n.trees) per each hyper parameter combination and then to use it to obtain predictions for n.trees < max(n.trees)

some plot

library(ggplot2)

ggplot(res$results)+
  geom_line(aes(x = n.trees, y = Accuracy))

enter image description here

You can also opt not to savePredictions = "all" since that makes for a memory hungry train object. But rather to use res$results in which you would calculate all the desired metrics.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • This looks a lot like what I wish to obtain. But is it possible to do the same with no crossvalidation? (trainControl(method = “none”)). Sorry I’m on mobile, editing is rather limited here. Because when I did that, than caret was telling me that I can not specify a range of values for a parameter. – M. Ike Aug 27 '18 at 10:01
  • Unfortunately no since with `method = “none”` no predictions are provided and you can specify just one hyper parameter combination. – missuse Aug 27 '18 at 10:06
  • That is really weird that this is not allowed. It seems like my only option is to create a custom model I guess? Thank you for all your help. – M. Ike Aug 27 '18 at 10:08
  • You can extract both train and test prediction with `mlr` library. Check my answer here: https://stackoverflow.com/questions/48754886/caret-obtain-train-cv-predictions-from-model-to-plot – missuse Aug 27 '18 at 10:10