2

I am trying to run a reproducible example with the mlr R package in parallel, for which I have found the solution of using parallelStartMulticore (link). The project runs with packrat as well.

The code runs properly on workstations and small servers, but running it in an HPC with the torque batch system runs into memory exhaustion. It seems that R threads are spawned ad infinitum, contrary to regular linux machines. I have tried to switch to parallelStartSocket, which works fine, but then I cannot reproduce the results with RNG seeds.

Here is a minimal example:

library(mlr)
library(parallelMap)
M <- data.frame(x = runif(1e2), y = as.factor(rnorm(1e2) > 0))

# Example with random forest 
parallelStartMulticore(parallel::detectCores())
plyr::l_ply(
  seq(100), 
  function(x) {
    message("Iteration number: ", x)

    set.seed(1, "L'Ecuyer")
    tsk <- makeClassifTask(data = M, target = "y")

    num_ps <- makeParamSet(
      makeIntegerParam("ntree", lower = 10, upper = 50), 
      makeIntegerParam("nodesize", lower = 1, upper = 5)
    )
    ctrl <- makeTuneControlGrid(resolution = 2L, tune.threshold = TRUE)

    # define learner
    lrn <- makeLearner("classif.randomForest", predict.type = "prob")
    rdesc <- makeResampleDesc("CV", iters = 2L, stratify = TRUE)

    # Grid search in parallel
    res <- tuneParams(
      lrn, task = tsk, resampling = rdesc, par.set = num_ps, 
      measures = list(auc), control = ctrl)

    # Fit optimal params
    lrn.optim <- setHyperPars(lrn, par.vals = res$x)
    m <- train(lrn.optim, tsk)

    # Test set
    pred_rf <- predict(m, newdata = M)

    pred_rf
  }
)
parallelStop()

The hardware of the HPC is an HP Apollo 6000 System ProLiant XL230a Gen9 Server blade 64-bit, with Intel Xeon E5-2683 processors. I ignore if the issue comes from the torque batch system, the hardware or any flaw in the above code. The sessionInfo() of the HPC:

R version 3.4.0 (2017-04-21)                                                                                                                                                       
Platform: x86_64-pc-linux-gnu (64-bit)                                                                                                                                             
Running under: CentOS Linux 7 (Core)                                                                                                                                               

Matrix products: default                                                                                                                                                           
BLAS/LAPACK: /cm/shared/apps/intel/parallel_studio_xe/2017/compilers_and_libraries_2017.0.098/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so                                          

locale:                                                                                                                                                                            
[1] C                                                                                                                                                                              

attached base packages:                                                                                                                                                            
[1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                               

other attached packages:                                                                                                                                                           
[1] parallelMap_1.3   mlr_2.11          ParamHelpers_1.10 RLinuxModules_0.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14        splines_3.4.0       munsell_0.4.3      
 [4] colorspace_1.3-2    lattice_0.20-35     rlang_0.1.1        
 [7] plyr_1.8.4          tools_3.4.0         parallel_3.4.0     
[10] grid_3.4.0          packrat_0.4.8-1     checkmate_1.8.2    
[13] data.table_1.10.4   gtable_0.2.0        randomForest_4.6-12
[16] survival_2.41-3     lazyeval_0.2.0      tibble_1.3.1       
[19] Matrix_1.2-12       ggplot2_2.2.1       stringi_1.1.5      
[22] compiler_3.4.0      BBmisc_1.11         scales_0.4.1       
[25] backports_1.0.5  
AnonQuest
  • 23
  • 2
  • I can run your example without problems. How exactly does the memory problem manifest itself? It sounds like something within R is running out of memory (as it works in socket mode). What learners are you using, in particular any that require RWeka? The only other thing I can think of off the top of my head is that you're doing so many evaluations that the history takes up all available memory. How soon does the error occur? – Lars Kotthoff Jan 25 '18 at 17:27
  • Hi Lars, it seems that the main process keeps spawning R processes for parallel calculations that are not cleaned up. But I cannot reproduce this on a regular workstation. I am using `classif.randomForest` and `classif.ksvm`. This sample script can eat 30GB of memory in under 5 minutes. The memory is not even freed when the script finishes, it has to be manually killed. – AnonQuest Jan 26 '18 at 08:11
  • That sounds like a bug in R. Are you running the latest version? – Lars Kotthoff Jan 26 '18 at 16:20
  • Thanks for the suggestion. We installed `R 3.4.3`, which is the version in my local machine, and the problem persists. Any ideas? – AnonQuest Jan 29 '18 at 14:44
  • Could you try to narrow it down? E.g. does it happen with other leaners as well, without tuning, with different resampling strategies? – Lars Kotthoff Jan 29 '18 at 16:18
  • This did happen with other learners, we have not tried other resamplings. But the root of this seems to be the calls to `mclapply`, which is indeed a broader issue. `mlr` is not to blame, should I edit the title? – AnonQuest Feb 01 '18 at 09:48

1 Answers1

4

The "multicore" parallelMap backend uses parallel::mcmapply which should create a new fork()ed child process for every evaluation inside tuneParams and then quickly kill that process. Depending on what you use to count memory usage / active processes, it is possible that memory gets mis-reported and that child processes that are already dead (and were only alive for the fraction of a second) are shown, or that killing of finished processes for some reason does not happen.

Possible problems:

  • The batch system does not correctly track memory usage and counts the parent process's memory for every child separately. Does /usr/bin/free actually report that 30GB are gone while the script is running? As an easier test case, consider (running in an empty R session)

    xxx <- 1:1e9
    parallel::mclapply(1:4, function(x) {
      Sys.sleep(60)
    }, mc.cores = 4)
    

    which should use about 4 GB of memory. If, during the 60 seconds in the child process, the reported memory usage is about 16 GB, it is this problem.

  • Memory reporting is accurate, but for some reason the memory space is changed a lot inside the child processes (triggering many COW writes), e.g. because of garbage collection. Does calling gc() before the tuneParams() call help?

  • Some setting on the machine prevents the "parallel" package from killing child processes. The following:

    parallel::mclapply(1:4, function(x) {
      xxx <<- 1:1e9 ; NULL
    }, mc.cores = 4)
    Sys.sleep(60)
    

    should grab about 16 GB of memory, but release it right away. If the memory remains used during the Sys.sleep (and the remaining R session), it might be this problem.

mb706
  • 652
  • 3
  • 5
  • Thanks for the answer. The problem is indeed with the snippet of code from your third bullet. The `parallel` package does not clean up the processes and they stick during `Sys.sleep`, after it and even after closing the `R` session. Calling `gc()` does not help. We are looking into the configuration of the HPC to see if we can further narrow it down and fix it. – AnonQuest Feb 01 '18 at 09:42
  • I am accepting this answer for being useful in narrowing down the issue. If we find out the exact problem I will post it as a comment. – AnonQuest Feb 15 '18 at 10:24