I am queueing and running an R script on a HPC cluster via sbatch and mpirun; the script is meant to use foreach
in parallel. To do this I've used several useful questions & answers from StackOverflow: R Running foreach dopar loop on HPC MPIcluster, Single R script on multiple nodes, Slurm: Use cores from multiple nodes for R parallelization.
It seems that the script completes, but a couple of strange things happen. The most important is that the slurm job keeps on running afterwards, doing nothing(?). I'd like to understand if I'm doing things properly. I'll first give some more specific information, then explain the strange things I'm seeing, then I'll ask my questions.
– Information:
R is loaded as a module, which also calls an OpenMPI module. The packages
Rmpi
,doParallel
,snow
,foreach
were already compiled and included in the module.The cluster has nodes with 20 CPUs each. My sbatch file books 2 nodes and 20 CPUs per node.
The R script
myscript.R
is called in the sbatch file like this:
mpirun -np 1 Rscript -e "source('myscript.R')"
- My script calls several libraries in this order:
library('snow')
library('Rmpi')
library('doParallel')
library('foreach')
and then sets up parallelization as follows at the beginning:
workers <- mpi.universe.size() - 1
cl <- makeMPIcluster(workers, outfile='', type='MPI')
registerDoParallel(cl)
Then several foreach-dopar
are called in succession – that is, each starts after the previous has finished. Finally
stopCluster(cl)
mpi.quit()
are called at the very end of the script.
mpi.universe.size()
correctly gives 40, as expected. Also,getDoParWorkers()
givesdoParallelSNOW
. The slurm log encouragingly says39 slaves are spawned successfully. 0 failed.
starting MPI worker
starting MPI worker
...
Also, calling print(clusterCall(cl, function() Sys.info()[c("nodename","machine")]))
from within the script correctly reports the node names shown in the slurm queue.
– What's strange:
The R script completes all its operations, the last one being saving a plot as pdf, which I do see and is correct. But the slurm job doesn't end, it remains in the queue indefinitely with status "running".
The slurm log shows very many lines with
Type: EXEC
. I can't find any relation between their number and the number offoreach
called. At the very end the log shows 19 lines withType: DONE
(which make sense to me).
– My questions:
- Why does the slurm job run indefinitely after the script has finished?
- Why the numerous
Type: EXEC
messages? are they normal? - There is some masking between packages
snow
anddoParallel
. Am I calling the right packages and in the right order? - Some answers to the StackOverflow questions mentioned above recommend to call the script with
mpirun -np 1 R --slave -f 'myscript.R'
instead of using Rscript
as I did. What's the difference? Note that the problems I mentioned remain even if I call the script this way, though.
I thank you very much for your help!