The gbm package in R has a handy feature of parallelizing cross-validation by sending each fold to its own node. I would like to build multiple cross-validated GBM models running over a range of hyperparameters. Ideally, because I have multiple cores, I could also parallelize the building of these multiple models. With 12 cores, I could- in theory- have 4 models building simultaneously with each using 3-fold validation. Something like this:
tuneGrid <- expand.grid(
n_trees = 5000,
shrink = c(.0001),
i.depth = seq(10,25,5),
minobs = 100,
distro = c(0,1) #0 = bernoulli, 1 = adaboost
)
cl <- makeCluster(4, outfile="GBMlistening.txt")
registerDoParallel(cl) #4 parent cores to run in parallel
err.vect <- NA #initialize
system.time(
err.vect <- foreach (j=1:nrow(tuneGrid), .packages=c('gbm'),.combine=rbind) %dopar% {
fit <- gbm(Label~., data=training,
n.trees = tuneGrid[j, 'n_trees'],
shrinkage = tuneGrid[j, 'shrink'],
interaction.depth=tuneGrid[j, 'i.depth'],
n.minobsinnode = tuneGrid[j, 'minobs'],
distribution=ifelse(tuneGrid[j, 'distro']==0, "bernoulli", "adaboost"),
w=weights$Weight,
bag.fraction=0.5,
cv.folds=3,
n.cores = 3) #will this make 4X3=12 workers?
cv.test <- data.frame(scores=1/(1 + exp(-fit$cv.fitted)), Weight=training$Weight, Label=training$Label)
print(j) #write out to the listener
cbind(gbm.roc.area(cv.test$Label, cv.test$scores), getAMS(cv.test), tuneGrid[j, 'n_trees'], tuneGrid[j, 'shrink'], tuneGrid[j, 'i.depth'],tuneGrid[j, 'minobs'], tuneGrid[j, 'distro'], j )
}
)
stopCluster(cl) #clean up after ourselves
I would use the caret package, however I have some hyperparameters beyond those defaulted in caret, and I would prefer not to build my own custom model in caret at this time. I am on a Windows machine, as I know that affects which parallel back-end gets used.
If I do this, will each of the 4 clusters I start up spawn off 3 workers each, for a total of 12 workers chugging away? Or will I only have 4 cores working at once?