I have an sbatch script that looks like this (each node has 128 cores):
#!/bin/tcsh
#SBATCH --nodes=5
srun -n 1 -c 1 ./exec opt1
srun -n 1 -c 1 ./exec opt2
srun -n 1 -c 1 ./exec opt3
srun -n 1 -c 1 ./exec opt4
srun -n 1 -c 1 ./exec opt5
srun -n 1 -c 1 ./exec opt6
srun -n 1 -c 1 ./exec opt7
srun -n 1 -c 1 ./exec opt8
srun -n 1 -c 1 ./exec opt9
srun -n 1 -c 1 ./exec opt10
srun -n 640 ./program.x
The first 10 srun
s all run sequentially and, when finished, execute my larger program. The first 10 srun
s could all execute at the same time, then wait for them to finish, then execute the final larger srun
. However, my nodes are set up to be exclusive, so right now I could probably have 5 going at a time, but it would be much more efficient to have them all going at once in a non-exclusive manner because they do not depend on each other. I also don't know which will take the longest, and it will change based on several factors on the cluster.
What srun
options do I need to use to get all of my srun
s to run simultaneously then wait until they are all complete.