How to show the status of all job steps defined in a sbatch script, including those not yet created due to resource contention

Question

I'm using SLURM sbatch to launch a bunch of parallel tasks in a cluster. The total amount of cores that I need to run all tasks in parallel exceeds the total amount of cores that my sbatch script asks for, so some job steps won't run until others have finished.

Here's an example script that reflects my use case: let's say each node in the cluster has 40 cores, I use sbatch to allocate 10 nodes, so 400 cores at my disposal. But I have 12 tasks to run, and each of my tasks is asking for 40 cores, so they need a total of 480 cores to run in parallel.

#!/bin/bash

#SBATCH --cpus-per-task=40
#SBATCH --nodes=10

#below is a total of 12 invocations of srun
srun --cpus-per-task=40 --nodes=1 --ntasks=1 --job-name=first <executable> &
srun --cpus-per-task=40 --nodes=1 --ntasks=1 --job-name=second <executable> &
...
srun --cpus-per-task=40 --nodes=1 --ntasks=1 --job-name=twelfth <executable> &

wait

My problem is, sacct won't show the status of all twelve job steps until all invocations of srun can get the resource they need. How can I adjust my way of using SLURM, so that immediately after I submit my batch script, I can inspect the state of all "twelve" job steps?

Here's my current way of operation:

Call sbatch <the script above>, and then call sacct -j <JobID>. At first, only ten job steps will show up in the output, all in running state:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
XXX            script      batch     (null)          0    RUNNING      0:0 
XXX.0          first                 (null)          0    RUNNING      0:0 
XXX.1          second                (null)          0    RUNNING      0:0 
XXX.2          third                 (null)          0    RUNNING      0:0 
XXX.3          fourth                (null)          0    RUNNING      0:0 
XXX.4          fifth                 (null)          0    RUNNING      0:0 
XXX.5          sixth                 (null)          0    RUNNING      0:0 
XXX.6          seventh               (null)          0    RUNNING      0:0 
XXX.7          eighth                (null)          0    RUNNING      0:0 
XXX.8          nineth                (null)          0    RUNNING      0:0 
XXX.9          tenth                 (null)          0    RUNNING      0:0

... and logfile slurm-.out would tell me: srun: Job XXX step creation temporarily disabled, retrying (Requested nodes are busy)

When one job step finally completes, the logfile will print a new line: srun: Step created for job XXX and the output of sacct -j <JobID> will look like this (note there are eleven job steps now):

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
XXX            script      batch     (null)          0    RUNNING      0:0 
XXX.0          first                 (null)          0    RUNNING      0:0 
XXX.1          second                (null)          0    RUNNING      0:0 
XXX.2          third                 (null)          0    RUNNING      0:0 
XXX.3          fourth                (null)          0    RUNNING      0:0 
XXX.4          fifth                 (null)          0    RUNNING      0:0 
XXX.5          sixth                 (null)          0    RUNNING      0:0 
XXX.6          seventh               (null)          0    RUNNING      0:0 
XXX.7          eighth                (null)          0  COMPLETED      0:0 
XXX.8          nineth                (null)          0    RUNNING      0:0 
XXX.9          tenth                 (null)          0    RUNNING      0:0 
XXX.10         eleventh              (null)          0    RUNNING      0:0

It could be possible I was missing on some options as the manual for SLURM is really unwieldy. I've already read How to know the status of each process of one job in the slurm cluster manager?, but that does not solve my problem.

I appreciate suggestions on how to solve my problem, or how to use SLURM in a "more correct" way.

How to show the status of all job steps defined in a sbatch script, including those not yet created due to resource contention

0 Answers0