1

I am running an mlr benchmark with about 12 learners. My code runs without any problem when I do not use parallelMap, but as soon as I add parallelization it crashes silently, always at the same point, even with only 2 cores.

I thought it must be running out of memory, so I have re-structured my code to use as little memory as possible by

  1. Setting a seed and then calling benchmark on only one learner at a time
  2. Placing the benchmark calls in nested functions so that the memory used by the object returned from the benchmark is freed up.

However, this has not helped. It always crashes during tuning of a random forest variable importance filter, but it successfully benchmarks a random forest prior to that. Here are the relevant code snippets:

parallelStart(mode="multicore", cpus=2, level="mlr.resample", show.info = TRUE, logging=TRUE, storagedir='/home/annette/Experiments/Logs_new')



    set.seed(24601, "L'Ecuyer")
    cox.filt.rsfrc.lrn = makeTuneWrapper(
                          makeFilterWrapper(
                                makeLearner(cl=base_learner, id = "cox.filt.rfsrc", predict.type="response"), 
                                fw.method="randomForestSRC_importance",
                                cache=TRUE
                          ), 
                          resampling = inner, 
                          par.set = makeParamSet(makeNumericParam("fw.perc", lower=0.01, upper=0.5)), 
                          control = makeTuneControlRandom(maxit=20),
                          show.info = TRUE)
    bmr = benchmark(cox.filt.rsfrc.lrn, surv.task, outer, surv.measures, show.info = TRUE, models=TRUE, keep.extract=FALSE)

There is nothing unusual in the logs. The last call to gc() before this returns:

Garbage collection 55 = 23+3+29 (level 2) ... 
110.3 Mbytes of cons cells used (61%)
27.2 Mbytes of vectors used (14%)

If I run just the RF varimp filter on its own with parallelization it succeeds. I have tried adding the following to .Renviron but it did not help:

R_NSIZE = 100M
R_VSIZE = 50M

Can anyone suggest how I can solve this problem, or at least how I can find out more information about what is going wrong?

EDIT:

Thanks to comments by @pat-s below, I realised that the R processes had not crashed but were sitting idle. So I killed them all off and an error message was written to the output files - the same one each time:

Mapping in parallel: mode = multicore; level = mlr.resample; cpus = 12; elements = 4.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Error in extractSubList(iter.results, "measures.train", simplify = "rows") : 
  Assertion on 'xs' failed: Must be of type 'list', not 'NULL'.
Calls: apply ... extractSubList -> assertList -> makeAssertion -> mstop
Execution halted
panda
  • 821
  • 1
  • 9
  • 20
  • Do you see anything in the logs? What do you mean exactly by "crash"? You say it only happens after you start the second run of a RF learner, i.e. the first one is fine? I think this should go as an issue to parallelMap on Github. – pat-s Sep 04 '19 at 08:46
  • No, no error messages in the logs. When I say "crash" I mean the Rscript program halts with no message - not Done or Exit, just nothing. I've only run it with 1 learner or all 12. When I run it with all 12 it crashes on #8. I will post an issue on Github. – panda Sep 04 '19 at 09:12
  • Does https://github.com/wlandau/drake-examples/issues/33 sound similar? I suspect something is wrong with _parallelMap_. – pat-s Sep 04 '19 at 09:15
  • Well I certainly have problems with xgboost and parallelMap, to the point where I have had to stop using it. See https://stackoverflow.com/questions/55978153/r-how-to-use-parallelmap-with-mlr-xgboost-on-linux-server-unexpected-perfor - my comment is right at the end. – panda Sep 04 '19 at 09:24
  • Actually, yes, that does sound like my problem. I just did ps -A and found that there were dozens of R processes still alive, but with 0 CPU time. I had been using top, so didn't see them and thought they had died. – panda Sep 04 '19 at 09:39

0 Answers0