1

I have a problem when I search for optimal hyperparameters of xgboost using mlr package in R, using Random Search method, on Ubuntu 18.04. This is the setup code for the search:

eta_value <- 0.05
set.seed(12345)

# 2. Create tasks
train.both$y <- as.factor(train.both$y) # altering y in train.both!
traintask <- makeClassifTask(data = train.both,target = "y")

# 3. Create learner
lrn <- makeLearner("classif.xgboost",predict.type = "prob")
lrn$par.vals <- list(
  objective="binary:logistic",
  booster = "gbtree",
  eval_metric="auc",
  early_stopping_rounds=10,
  nrounds=xgbcv$best_iteration,
  eta=eta_value,
  weight = train_data$weights
)

# 4. Set parameter space
params <- makeParamSet(
  makeDiscreteParam("max_depth", values = c(4,6,8,10)),
  makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
  makeDiscreteParam("subsample", values = c(0.5, 0.75, 1)),
  makeDiscreteParam("colsample_bytree", values = c(0.4, 0.6, 0.8, 1)),
  makeNumericParam("gamma",lower = 0L,upper = 7L)
)

# 5. Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=10L)

# 6. Search strategy
ctrl <- makeTuneControlRandom(maxit = 60L, tune.threshold = F)

# Set parallel backend and tune parameters
parallelStartMulticore(cpus = detectCores())

# 7. Parameter tuning
timer <- proc.time()
mytune <- tuneParams(learner = lrn,
                     task = traintask,
                     resampling = rdesc,
                     measures = auc,
                     par.set = params,
                     control = ctrl,
                     show.info = T)
proc.time() - timer
parallelStop

As you can see I distribute the search task among all my CPU cores. The problem is that it has been over 5 days and the task is still running - this is the mlr output for the task (displayed when the task is running):

[Tune] Started tuning learner classif.xgboost for parameter set:
                     Type len Def        Constr Req Tunable Trafo
max_depth        discrete   -   -      4,6,8,10   -    TRUE     -
min_child_weight  numeric   -   -       1 to 10   -    TRUE     -
subsample        discrete   -   -    0.5,0.75,1   -    TRUE     -
colsample_bytree discrete   -   - 0.4,0.6,0.8,1   -    TRUE     -
gamma             numeric   -   -        0 to 7   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0
Mapping in parallel: mode = multicore; level = mlr.tuneParams; cpus = 16; elements = 60.

I used to run this on my macbook pro laptop and it finished within approximately 8 hours. The laptop was 15-inch 2018 2.6 GHz intel core i7 (6 cores) with 32 GB memory DDR4. Now I run it on a much stronger computer - the only thing that is changed is that this is an Ubuntu OS. The machine I'm having this problem on is a stationary computer with Intel i9-9900K CPU @ 3.60GHz x 16 cores. The desktop is GNOME 3.28.2, OS type is 64-bit and it has 64GB of RAM.

I have attached a screenshot which I took during the running of the mlr searching task - it shows that not all the CPU cores are engaged, something that was the opposite when I ran this on the MacBook Pro laptop.

What is the problem here? Is it something that has to do with the Ubuntu system and its capabilities of parallelization? I have found a somewhat-similar question here but there was no apparent solution there as well.

enter image description here

When I try to run this from the terminal instead of from RStudio, it still seems that the cores are not engaged: run script from terminal

terminal output

Corel
  • 581
  • 3
  • 21

1 Answers1

2

There is nothing running at all according to your screenshot. Based on your setup all cores should be at 100%.

Your issue has nothing to do with your operating system per se. In fact, Linux is most often the best choice when it comes to parallelization.

There are some problems sometimes when combining the "multicore" mode with xgboost, see for example https://github.com/berndbischl/parallelMap/issues/72.

You can simply try again. If that does not work, try switching the parallelization mode to "socket".

It is hard to detect the real root of your problem since there are multiple players involved in the game (ports, conflicts with openMP, etc.).

pat-s
  • 5,992
  • 1
  • 32
  • 60
  • I have edited the question and added screenshots when I tried to run my script from terminal - it still seems that the cores are not engaged. I thought that running it from terminal will solve things after reading the github link you have posted. After reading some issues on github it seems that this issue is not fixed. From what you've mentioned there - using `socket` mode will leave zombie processes. – Corel Oct 22 '19 at 11:50
  • Are you able to execute any parallelization in R? Have you tried a different algorithm than xgboost? Have you tried "socket" mode? All of these will help you tracking down the issue. – pat-s Oct 22 '19 at 15:50
  • I tried to use "socket" mode and indeed I see the CPUs all on 100% now. But when I look at the logs (I have set logging to True), I see that it took 21 hours for the different CPU cores to finish just the first 15 iterations! (iteration per core) - now it will work another 21 hours for the next 15 iterations. This is much much slower than it was on my macbook pro where it took 8-10 hours to complete the whole 60 iterations I have set. Why is it this slow? The machine I'm running this on is stronger than the macbook pro! – Corel Oct 24 '19 at 10:14
  • I cannot tell what's going on on your machine and why it is not as fast as expected. This is beyond the scope of this question and can also not be answered here. – pat-s Oct 24 '19 at 14:11
  • What I meant to ask is does "socket" mode is known to be slower than the multiCore mode? – Corel Oct 24 '19 at 14:14