I have a (small) list of n scripts that I need to submit to slurm on linux. Each script does some work and then writes output to a file. The work portion of each script executes much faster when I request 32 cores than when I request 16 or (worse) 8 cores; however, the wait time for scheduling is usually highest for 32 cores, then 16, then 8. Depending on conditions outside of my control that influence wait times, requesting 32 cores may result in the lowest total time, or it may not.
My solution has been to submit n*3 jobs, one for each script and each number of processors in {32, 16, 8}. For each script, I only need one process to finish and I don't care which it is. Up to now, I manually check each process's output for evidence of having finished and then manually cancel the other two processes running the same script. I would like to automate this.
How can I simultaneously run n groups of processes, wait for the first process in each group to finish (at which point, the other processes in the group should be canceled), and wait for this to occur for all of the groups before moving on to further commands in the script?
My current code to submit the jobs is:
for i in {1..9};
do for p in 32 16 8;
do srun -t 3:00:00 -N 1 -n 1 -c $p --mem=50g python my_script_$i.py $p > my_script_${i}_${p}.out &
done;
done;
wait
I've looked into the wait command, but I'm not sure how to wait for any process (as opposed to all processes or a particular one) to finish.
I am also open to the possibility that there are better ways to submit these jobs to slurm than in a loop using srun--I am a slurm beginner.
Edit: https://stackoverflow.com/a/41613532/10499953 might be relevant but I'm not sure how to make that work in parallel.