Multiple srun on a node (non-exclusive) and wait for completion

Question

I have an sbatch script that looks like this (each node has 128 cores):

#!/bin/tcsh
#SBATCH --nodes=5

srun -n 1 -c 1 ./exec opt1
srun -n 1 -c 1 ./exec opt2
srun -n 1 -c 1 ./exec opt3
srun -n 1 -c 1 ./exec opt4
srun -n 1 -c 1 ./exec opt5
srun -n 1 -c 1 ./exec opt6
srun -n 1 -c 1 ./exec opt7
srun -n 1 -c 1 ./exec opt8
srun -n 1 -c 1 ./exec opt9
srun -n 1 -c 1 ./exec opt10

srun -n 640 ./program.x

The first 10 sruns all run sequentially and, when finished, execute my larger program. The first 10 sruns could all execute at the same time, then wait for them to finish, then execute the final larger srun. However, my nodes are set up to be exclusive, so right now I could probably have 5 going at a time, but it would be much more efficient to have them all going at once in a non-exclusive manner because they do not depend on each other. I also don't know which will take the longest, and it will change based on several factors on the cluster.

What srun options do I need to use to get all of my sruns to run simultaneously then wait until they are all complete.

score 2 · Answer 1 · answered Oct 06 '22 at 07:49

You can typically do this in a loop by backgrounding the processes and waiting at the end, e.g.:

#!/bin/tcsh
#SBATCH --nodes=5

foreach i ( `seq 1 10` )
    srun -n 1 -c 1 ./exec opt${i} &
end
wait

srun -n 640 ./program.x

You may need to add other options to the single core srun options (e.g. specify --exact) to get them all to run in parallel and/or pin to different cores but that depends on your local Slurm and system configuration so you consult your local documentation/support on exactly what options are needed.

Multiple srun on a node (non-exclusive) and wait for completion

1 Answers1