0

I am trying to get caret to train xgboost models over a grid of hyperparameters using a parallel backend.

Here is some code that uses the Give Me Some Credit data to demonstrate setting up a parallel backend for caret's hyperparameter grid search.

library(plyr)
library(dplyr)
library(pROC)
library(caret)
library(xgboost)
library(readr)
library(parallel)
library(doParallel)

if(exists("xgboost_cluster")) stopCluster(xgboost_cluster)
hosts = paste0("192.168.18.", 52:53)
xgboost_cluster = makePSOCKcluster(hosts, master="192.168.18.51")

# load the packages across the cluster
clusterEvalQ(xgboost_cluster, {
  deps = c("plyr", "Rcpp", "dplyr", "caret", "xgboost", "pROC", "foreach", "doParallel")
  for(d in deps) library(d, character.only = TRUE)
  rm(d, deps)
})

registerDoParallel(xgboost_cluster)  
# load in the training data
df_train = read_csv("04-GiveMeSomeCredit/Data/cs-training.csv") %>%
  na.omit() %>%                                                                # listwise deletion 
  select(-`[EMPTY]`) %>%
  mutate(SeriousDlqin2yrs = factor(SeriousDlqin2yrs,                           # factor variable for classification
                                   labels = c("Failure", "Success")))
# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
  nrounds = 1000,
  eta = c(0.01, 0.001, 0.0001),
  max_depth = c(2, 4, 6, 8, 10),
  gamma = 1
)

# pack the training control parameters
xgb_trcontrol_1 = trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE,
  returnData = FALSE,
  returnResamp = "all",                                                        # save losses across all models
  classProbs = TRUE,                                                           # set to TRUE for AUC to be computed
  summaryFunction = twoClassSummary,
  allowParallel = TRUE
)

# train the model for each parameter combination in the grid, 
#   using CV to evaluate
xgb_train_1 = train(
  x = as.matrix(df_train %>%
                  select(-SeriousDlqin2yrs)),
  y = as.factor(df_train$SeriousDlqin2yrs),
  trControl = xgb_trcontrol_1,
  tuneGrid = xgb_grid_1,
  method = "xgbTree"
)

I checked that all cores on the hosts are being utilized for training, but on the master node, no processes are utilized. Is this expected behaviour? Any way to change this behaviour and leverage the cores on the master node for processing as well?

tchakravarty
  • 10,736
  • 12
  • 72
  • 116

1 Answers1

1

In order to utilize the master node for processing, you just need to add 'localhost' to hosts, like so:

hosts = c("localhost", paste0("192.168.18.", 52:53))

This will add one core on your master node to the cluster, which will then be used for processing. If you want to add multiple cores, just pass in more instances of 'localhost'.

hosts = c(rep('localhost', detectCores()), paste0("192.168.18.", 52:53))
mcanearm
  • 60
  • 5
  • At the cost of sounding obnoxious, I think I already knew this, but was trying to add `192.168.18.51` to the list of hosts, and that was not working because the master itself was not added to the list of trusted hosts! Specifying `localhost` is the right way to do this, thanks much. :-) – tchakravarty Nov 16 '15 at 17:52