Building parallel GBM models using cross-validation in R

Question

The gbm package in R has a handy feature of parallelizing cross-validation by sending each fold to its own node. I would like to build multiple cross-validated GBM models running over a range of hyperparameters. Ideally, because I have multiple cores, I could also parallelize the building of these multiple models. With 12 cores, I could- in theory- have 4 models building simultaneously with each using 3-fold validation. Something like this:

tuneGrid <- expand.grid(
        n_trees = 5000, 
        shrink = c(.0001),
        i.depth = seq(10,25,5),
        minobs = 100,
        distro = c(0,1) #0 = bernoulli, 1 = adaboost
        )
cl <- makeCluster(4, outfile="GBMlistening.txt")
registerDoParallel(cl) #4 parent cores to run in parallel
err.vect <- NA #initialize
system.time(
err.vect <- foreach (j=1:nrow(tuneGrid), .packages=c('gbm'),.combine=rbind) %dopar% {
        fit <- gbm(Label~., data=training, 
            n.trees = tuneGrid[j, 'n_trees'], 
            shrinkage = tuneGrid[j, 'shrink'],
            interaction.depth=tuneGrid[j, 'i.depth'], 
            n.minobsinnode = tuneGrid[j, 'minobs'], 
            distribution=ifelse(tuneGrid[j, 'distro']==0, "bernoulli", "adaboost"),
            w=weights$Weight,
            bag.fraction=0.5,
            cv.folds=3,
            n.cores = 3) #will this make 4X3=12 workers?
        cv.test <- data.frame(scores=1/(1 + exp(-fit$cv.fitted)), Weight=training$Weight, Label=training$Label)
        print(j) #write out to the listener
        cbind(gbm.roc.area(cv.test$Label, cv.test$scores), getAMS(cv.test), tuneGrid[j, 'n_trees'], tuneGrid[j, 'shrink'], tuneGrid[j, 'i.depth'],tuneGrid[j, 'minobs'], tuneGrid[j, 'distro'], j )
}
)
stopCluster(cl) #clean up after ourselves

I would use the caret package, however I have some hyperparameters beyond those defaulted in caret, and I would prefer not to build my own custom model in caret at this time. I am on a Windows machine, as I know that affects which parallel back-end gets used.

If I do this, will each of the 4 clusters I start up spawn off 3 workers each, for a total of 12 workers chugging away? Or will I only have 4 cores working at once?

You may check this in task manager on Windows, how many processes will be launched. — DrDom, Aug 21 '14 at 19:24
IF I RUN YOUR CODE, I GET ":task 1 failed - "object of type 'closure' is not subsettable", ANY HELP ? — bicepjai, Jul 29 '15 at 03:09
@bicepjai what part of the code results in that error? I might be able to help you debug if you give some details. However you might start a new question since it's a different topic and casued by something other than the original intent of my question. — Amw 5G, Aug 06 '15 at 11:47

score 0 · Accepted Answer · answered Aug 28 '14 at 13:04

I believe this will do what you want. The foreach loop will run four instances of gbm, and each of them will create a three node cluster using makeCluster. So you'll actually have 16 workers, but only 12 will perform serious computation at any one time. You have to be careful with nested parallelism, but I think this will work.

Building parallel GBM models using cross-validation in R

1 Answers1