After using the Slurm cluster manager to sbatch a job with multiple processes, is there a way to know the status (running or finishing) of each process? Can it be implemented in a python script?
Asked
Active
Viewed 888 times
2
-
One way is login to the compute-node and use the regular Linux tools (`top`, `htop`, `ps`). Surely Python can wrap something. – Tom de Geus Jun 04 '18 at 15:54
2 Answers
3
Just use the command sacct
that comes with Slurm.
Given this code (my.sh
):
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=2
srun -n1 sleep 10 &
srun -n1 sleep 3
wait
I run it:
sbatch my.sh
And then check on it with sacct
:
sacct
Which gives me per-step info:
JobID JobName Partition Account AllocCPUS State ExitCode
---------- ---------- ---------- ---------- ---------- ---------- --------
8021 my.sbatch CLUSTER me 2 RUNNING 0:0
8021.0 sleep me 1 RUNNING 0:0
8021.1 sleep me 1 COMPLETED 0:0
sacct
has a lot of options to customize its output. For example,
sacct --format='JobID%6,State'
Will just give you the IDs (up to 6 characters) and the current state of jobs:
JobID State
------ ----------
8021 RUNNING
8021.0 RUNNING
8021.1 COMPLETED

Christopher Bottoms
- 11,218
- 8
- 50
- 99
2
If the processes you mention are distincts steps, then sacct
can give you the information as explained by @Christopher Bottoms.
But if the processes are different tasks in a single step, then you can use this script that uses parallel SSH to run 'ps' commands on the compute nodes and offer a summarised view, as @Tom de Geus suggests.

damienfrancois
- 52,978
- 9
- 96
- 110