0

I have a (small) list of n scripts that I need to submit to slurm on linux. Each script does some work and then writes output to a file. The work portion of each script executes much faster when I request 32 cores than when I request 16 or (worse) 8 cores; however, the wait time for scheduling is usually highest for 32 cores, then 16, then 8. Depending on conditions outside of my control that influence wait times, requesting 32 cores may result in the lowest total time, or it may not.

My solution has been to submit n*3 jobs, one for each script and each number of processors in {32, 16, 8}. For each script, I only need one process to finish and I don't care which it is. Up to now, I manually check each process's output for evidence of having finished and then manually cancel the other two processes running the same script. I would like to automate this.

How can I simultaneously run n groups of processes, wait for the first process in each group to finish (at which point, the other processes in the group should be canceled), and wait for this to occur for all of the groups before moving on to further commands in the script?

My current code to submit the jobs is:

for i in {1..9};
    do for p in 32 16 8;
        do srun -t 3:00:00 -N 1 -n 1 -c $p --mem=50g python my_script_$i.py $p > my_script_${i}_${p}.out &
        done;
    done;
wait

I've looked into the wait command, but I'm not sure how to wait for any process (as opposed to all processes or a particular one) to finish.

I am also open to the possibility that there are better ways to submit these jobs to slurm than in a loop using srun--I am a slurm beginner.

Edit: https://stackoverflow.com/a/41613532/10499953 might be relevant but I'm not sure how to make that work in parallel.

Attila the Fun
  • 327
  • 2
  • 13
  • 2
    `bash` 4.3 introduced a `-n` option to `wait` that will block until *any one* background job completes. – chepner Jul 16 '20 at 15:14

1 Answers1

2

Run each group in a subshell (in the background), so that wait -n can wait for a job in that group to complete.

for i in {1..9}; do
  ( for p in 32 16 8; do
      srun ... & jobs+=($!)
    done
    wait -n  "${jobs[@]}" # Wait for one of the preceding 3 jobs to complete
    kill "${jobs[@]}"     # Kill the other two
  ) &
done

wait  # Wait for each of the 9 groups to complete
chepner
  • 497,756
  • 71
  • 530
  • 681
  • This is great, but I found out the RHEL 7.7 cluster I am running this on uses bash 4.2.46. Sometimes I am able to install packages in my home directory without admin privileges, but it seems to be difficult to get bash 4.3 for some reason. Is there a solution that doesn't rely on bash 4.3? – Attila the Fun Jul 17 '20 at 00:03
  • I was able to build a newer bash from source, and it seems to be working. – Attila the Fun Jul 17 '20 at 01:30