2

After using the Slurm cluster manager to sbatch a job with multiple processes, is there a way to know the status (running or finishing) of each process? Can it be implemented in a python script?

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
Yulong Ao
  • 1,199
  • 1
  • 14
  • 22
  • One way is login to the compute-node and use the regular Linux tools (`top`, `htop`, `ps`). Surely Python can wrap something. – Tom de Geus Jun 04 '18 at 15:54

2 Answers2

3

Just use the command sacct that comes with Slurm.

Given this code (my.sh):

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=2

srun -n1 sleep 10 &
srun -n1 sleep 3

wait

I run it:

sbatch my.sh

And then check on it with sacct:

sacct

Which gives me per-step info:

     JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
---------- ---------- ---------- ---------- ---------- ---------- --------
8021        my.sbatch    CLUSTER        me          2     RUNNING      0:0
8021.0          sleep                   me          1     RUNNING      0:0
8021.1          sleep                   me          1   COMPLETED      0:0

sacct has a lot of options to customize its output. For example,

sacct --format='JobID%6,State'

Will just give you the IDs (up to 6 characters) and the current state of jobs:

 JobID      State
------ ----------
  8021    RUNNING
8021.0    RUNNING
8021.1  COMPLETED
Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
2

If the processes you mention are distincts steps, then sacct can give you the information as explained by @Christopher Bottoms.

But if the processes are different tasks in a single step, then you can use this script that uses parallel SSH to run 'ps' commands on the compute nodes and offer a summarised view, as @Tom de Geus suggests.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110