I am running an mlr benchmark with about 12 learners. My code runs without any problem when I do not use parallelMap, but as soon as I add parallelization it crashes silently, always at the same point, even with only 2 cores.
I thought it must be running out of memory, so I have re-structured my code to use as little memory as possible by
- Setting a seed and then calling benchmark on only one learner at a time
- Placing the benchmark calls in nested functions so that the memory used by the object returned from the benchmark is freed up.
However, this has not helped. It always crashes during tuning of a random forest variable importance filter, but it successfully benchmarks a random forest prior to that. Here are the relevant code snippets:
parallelStart(mode="multicore", cpus=2, level="mlr.resample", show.info = TRUE, logging=TRUE, storagedir='/home/annette/Experiments/Logs_new')
set.seed(24601, "L'Ecuyer")
cox.filt.rsfrc.lrn = makeTuneWrapper(
makeFilterWrapper(
makeLearner(cl=base_learner, id = "cox.filt.rfsrc", predict.type="response"),
fw.method="randomForestSRC_importance",
cache=TRUE
),
resampling = inner,
par.set = makeParamSet(makeNumericParam("fw.perc", lower=0.01, upper=0.5)),
control = makeTuneControlRandom(maxit=20),
show.info = TRUE)
bmr = benchmark(cox.filt.rsfrc.lrn, surv.task, outer, surv.measures, show.info = TRUE, models=TRUE, keep.extract=FALSE)
There is nothing unusual in the logs. The last call to gc() before this returns:
Garbage collection 55 = 23+3+29 (level 2) ...
110.3 Mbytes of cons cells used (61%)
27.2 Mbytes of vectors used (14%)
If I run just the RF varimp filter on its own with parallelization it succeeds. I have tried adding the following to .Renviron but it did not help:
R_NSIZE = 100M
R_VSIZE = 50M
Can anyone suggest how I can solve this problem, or at least how I can find out more information about what is going wrong?
EDIT:
Thanks to comments by @pat-s below, I realised that the R processes had not crashed but were sitting idle. So I killed them all off and an error message was written to the output files - the same one each time:
Mapping in parallel: mode = multicore; level = mlr.resample; cpus = 12; elements = 4.
Error in extractSubList(iter.results, "measures.train", simplify = "rows") :
Assertion on 'xs' failed: Must be of type 'list', not 'NULL'.
Calls: apply ... extractSubList -> assertList -> makeAssertion -> mstop
Execution halted