R: How to use parallelMap (with mlr, xgboost) on linux server? Unexpected performance compared to windows

Question

I am trying to parallelize at the tuning hyperparameter level an xgboost model that I am tuning in mlr and am trying to parallelize with parallelMap. I have code that works successfully on my windows machine (with only 8 cores) and would like to make use of a linux server (with 72 cores). I have not been able to successfully gain any computational advantage moving to the server, and I think this is a result of holes in my understanding of the parallelMap parameters.

I do not understand the differences in multicore vs local vs socket as "modes" in parallelMap. Based on my reading, I think that multicore would work for my situation, but I am not sure. I used socket successfully on my windows machine and have tried both socket and multicore on my linux server, with unsuccessful results.

parallelStart(mode="socket", cpu=8, level="mlr.tuneParams")

but it is my understanding that socket might be unnecessary or perhaps slow for parallelizing over many cores that do not need to communicate with each other, as is the case with parallelizing hyperparameter tuning.

To elaborate on my unsuccessful results on my linux server: I am not getting errors, but things that would take <24 hours in serial are taking > 2 weeks in parallel. Looking at the processes, I can see that I am indeed using several cores.

Each individual call xgboost runs in the matter of a few minutes, and I am not trying to speed that up. I am only trying to tune hyperparmeters over several cores.

I was concerned that perhaps my very slow results on my linux server were due to attempts by xgboost to make use of the available cores in model building, so I fed nthread = 1 to xgboost via mlr to ensure that does not happen. Nonetheless, my code seems to run much slower on my larger linux server than it does on my smaller windows computer -- any thoughts as to what might be happening?

Thanks so very much.

xgb_learner_tune <- makeLearner(
  "classif.xgboost",
  predict.type = "response",
  par.vals = list(
    objective = "binary:logistic",
    eval_metric = "map",
    nthread=1))

library(parallelMap)
parallelStart(mode="multicore", cpu=8, level="mlr.tuneParams")

tuned_params_trim <- tuneParams(
  learner = xgb_learner_tune,
  task = trainTask,
  resampling = resample_desc,
  par.set = xgb_params,
  control = control,
  measures = list(ppv, tpr, tnr, mmce)
)
parallelStop()

Edit

I am still surprised by my lack of performance improvement attempting to parallelize at the tuning level. Are my expectations unfair? I am getting substantially slower performance with parallelMap than tuning in serial for the below process:

numeric_ps = makeParamSet(
  makeNumericParam("C", lower = 0.5, upper = 2.0),
  makeNumericParam("sigma", lower = 0.5, upper = 2.0)
)
ctrl = makeTuneControlRandom(maxit=1024L)
rdesc = makeResampleDesc("CV", iters = 3L)

#In serial
start.time.serial <- Sys.time()
res.serial = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc,
                 par.set = numeric_ps, control = ctrl)
stop.time.serial <- Sys.time()
stop.time.serial - start.time.serial

#In parallel with 2 CPUs
start.time.parallel.2 <- Sys.time()
parallelStart(mode="multicore", cpu=2, level="mlr.tuneParams")
res.parallel.2 = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc,
                 par.set = numeric_ps, control = ctrl)
parallelStop()
stop.time.parallel.2 <- Sys.time()
stop.time.parallel.2 - start.time.parallel.2

#In parallel with 16 CPUs
start.time.parallel.16 <- Sys.time()
parallelStart(mode="multicore", cpu=16, level="mlr.tuneParams")
res.parallel.16 = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc,
                          par.set = numeric_ps, control = ctrl)
parallelStop()
stop.time.parallel.16 <- Sys.time()
stop.time.parallel.16 - start.time.parallel.16

My console output is (tuning details omitted):

> stop.time.serial - start.time.serial
Time difference of 33.0646 secs

> stop.time.parallel - start.time.parallel
Time difference of 2.49616 mins

> stop.time.parallel.16 - start.time.parallel.16
Time difference of 2.533662 mins

I would have expected things to be faster in parallel. Is that unreasonable for this example? If so, when should I expect performance improvements in parallel?

Looking at the terminal, I do seem to be using 2 (and 16) threads/processes (apologies if my terminology is incorrect).

Thanks so much for any further input.

Did you check whether your code is actually using all 72 cores? The code you've posted uses only 8 cores, so you can't expect a speedup moving to more cores. It sounds like this is a KNL machine; keep in mind that the clock speed of each core is a fraction of the clock speed on your Windows computer, so everything will take much longer. — Lars Kotthoff, May 04 '19 at 00:46

pat-s · Accepted Answer · 2019-06-14T05:55:15.473

This question is more about guessing whats wrong in your setup than actually providing a "real" answer. Maybe you could also change the title as you did not get "unexpected results".

Some points:

nthread = 1 is already the default for xgboost in mlr
multicore is the preferred mode on UNIX systems
If your local machine is faster than your server, than either your calculations finish very quickly and the CPU freq between both is substantially different or you should think about parallelizing another level than mlr.tuneParams (see here for more information)

Edit

Everythings fine on my machine. Looks like a local problem on your side.

library(mlr)
#> Loading required package: ParamHelpers
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
library(parallelMap)

numeric_ps = makeParamSet(
  makeNumericParam("C", lower = 0.5, upper = 2.0),
  makeNumericParam("sigma", lower = 0.5, upper = 2.0)
)
ctrl = makeTuneControlRandom(maxit=1024L)
rdesc = makeResampleDesc("CV", iters = 3L)

#In serial
start.time.serial <- Sys.time()
res.serial = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc,
  par.set = numeric_ps, control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#>          Type len Def   Constr Req Tunable Trafo
#> C     numeric   -   - 0.5 to 2   -    TRUE     -
#> sigma numeric   -   - 0.5 to 2   -    TRUE     -
#> With control class: TuneControlRandom
#> Imputation value: 1
stop.time.serial <- Sys.time()
stop.time.serial - start.time.serial
#> Time difference of 31.28781 secs


#In parallel with 2 CPUs
start.time.parallel.2 <- Sys.time()
parallelStart(mode="multicore", cpu=2, level="mlr.tuneParams")
#> Starting parallelization in mode=multicore with cpus=2.
res.parallel.2 = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc,
  par.set = numeric_ps, control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#>          Type len Def   Constr Req Tunable Trafo
#> C     numeric   -   - 0.5 to 2   -    TRUE     -
#> sigma numeric   -   - 0.5 to 2   -    TRUE     -
#> With control class: TuneControlRandom
#> Imputation value: 1
#> Mapping in parallel: mode = multicore; level = mlr.tuneParams; cpus = 2; elements = 1024.
#> [Tune] Result: C=1.12; sigma=0.647 : mmce.test.mean=0.0466667
parallelStop()
#> Stopped parallelization. All cleaned up.
stop.time.parallel.2 <- Sys.time()
stop.time.parallel.2 - start.time.parallel.2
#> Time difference of 16.13145 secs


#In parallel with 4 CPUs
start.time.parallel.16 <- Sys.time()
parallelStart(mode="multicore", cpu=4, level="mlr.tuneParams")
#> Starting parallelization in mode=multicore with cpus=4.
res.parallel.16 = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc,
  par.set = numeric_ps, control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#>          Type len Def   Constr Req Tunable Trafo
#> C     numeric   -   - 0.5 to 2   -    TRUE     -
#> sigma numeric   -   - 0.5 to 2   -    TRUE     -
#> With control class: TuneControlRandom
#> Imputation value: 1
#> Mapping in parallel: mode = multicore; level = mlr.tuneParams; cpus = 4; elements = 1024.
#> [Tune] Result: C=0.564; sigma=0.5 : mmce.test.mean=0.0333333
parallelStop()
#> Stopped parallelization. All cleaned up.
stop.time.parallel.16 <- Sys.time()
stop.time.parallel.16 - start.time.parallel.16 
#> Time difference of 10.14408 secs

^{Created on 2019-06-14 by the reprex package (v0.3.0)}

Pat, thanks for this reply. I changed the title from unexpected results to unexpected performance -- is that what you were getting at? I'm revisiting this problem and am going to update my question with a very small example with iris.task and performance results from my system. I find them unexpected -- would you be able to provide input about whether I need to revise my expectations or my setup? Thank you! — PBB, Jun 14 '19 at 01:39
Thanks. As you can see from my reprex, everythings fine on the mlr side. This has to be a problem of your local machine. — pat-s, Jun 14 '19 at 05:55
Fascinating. Thank you so, so, so much for taking the time to run that on your machine -- I had been assuming there was something wrong with my implementation (code) that was suboptimal. I will take this performance comparison to IT as server details are entirely outside of my wheelhouse. — PBB, Jun 14 '19 at 18:18
Did you ever find out what was causing the slow performance in your case? I am experiencing the same problem. Tuning of an XGBoost learner on a WIndows single core machine runs in less than an hour for 25 interations x 20 different tuning settings. When I run the same code on a linux machine with 12 cores using ParallelMap it takes so long to run that I give up and kiil it off (e.g. ~36 hours). What can be causing this? All packages are up to date. — panda, Aug 26 '19 at 06:34
I'm also interested in the outcome. I've encountered the same problem: parallelMap speeds things up on windows but slows things down on my linux ec2 instance. — Andrew Royal, Oct 02 '19 at 11:49
Have you tried using `mode = "socket"`? Forking might interfere with openmp parallelization from xgboost. I cannot reproduce the issues so it is hard to debug in more detail. — pat-s, Oct 02 '19 at 14:05

R: How to use parallelMap (with mlr, xgboost) on linux server? Unexpected performance compared to windows

Edit

1 Answers1

Edit

Linked