2

I am queueing and running an R script on a HPC cluster via sbatch and mpirun; the script is meant to use foreach in parallel. To do this I've used several useful questions & answers from StackOverflow: R Running foreach dopar loop on HPC MPIcluster, Single R script on multiple nodes, Slurm: Use cores from multiple nodes for R parallelization.

It seems that the script completes, but a couple of strange things happen. The most important is that the slurm job keeps on running afterwards, doing nothing(?). I'd like to understand if I'm doing things properly. I'll first give some more specific information, then explain the strange things I'm seeing, then I'll ask my questions.

– Information:

  • R is loaded as a module, which also calls an OpenMPI module. The packages Rmpi, doParallel, snow, foreach were already compiled and included in the module.

  • The cluster has nodes with 20 CPUs each. My sbatch file books 2 nodes and 20 CPUs per node.

  • The R script myscript.R is called in the sbatch file like this:

mpirun -np 1 Rscript -e "source('myscript.R')"
  • My script calls several libraries in this order:
library('snow')
library('Rmpi')
library('doParallel')
library('foreach')

and then sets up parallelization as follows at the beginning:

workers <- mpi.universe.size() - 1
cl <- makeMPIcluster(workers, outfile='', type='MPI')
registerDoParallel(cl)

Then several foreach-dopar are called in succession – that is, each starts after the previous has finished. Finally

stopCluster(cl)
mpi.quit()

are called at the very end of the script.

  • mpi.universe.size() correctly gives 40, as expected. Also, getDoParWorkers() gives doParallelSNOW. The slurm log encouragingly says

    39 slaves are spawned successfully. 0 failed.
    starting MPI worker
    starting MPI worker
    ...

Also, calling print(clusterCall(cl, function() Sys.info()[c("nodename","machine")])) from within the script correctly reports the node names shown in the slurm queue.

– What's strange:

  • The R script completes all its operations, the last one being saving a plot as pdf, which I do see and is correct. But the slurm job doesn't end, it remains in the queue indefinitely with status "running".

  • The slurm log shows very many lines with Type: EXEC. I can't find any relation between their number and the number of foreach called. At the very end the log shows 19 lines with Type: DONE (which make sense to me).

– My questions:

  • Why does the slurm job run indefinitely after the script has finished?
  • Why the numerous Type: EXEC messages? are they normal?
  • There is some masking between packages snow and doParallel. Am I calling the right packages and in the right order?
  • Some answers to the StackOverflow questions mentioned above recommend to call the script with
mpirun -np 1 R --slave -f 'myscript.R'

instead of using Rscript as I did. What's the difference? Note that the problems I mentioned remain even if I call the script this way, though.

I thank you very much for your help!

pglpm
  • 516
  • 4
  • 14
  • I am having the very same problem, at least in regards to the script running indefinitely. I believe it has to do with the `MPI_Comm_disconnect` command hanging. If you search for it you'll find several issues related to this. The only work-around I've come up with is to include a `stop()` call at the end of the script. – Johan Larsson Dec 03 '19 at 11:38

0 Answers0