21

I have submitted a job to a SLURM queue, the job has run and completed. I then check the completed jobs using the sacct command. But looking at the results of the sacct command I notice additional results that I did not expect:

       JobID                        JobName      State      NCPUS  Timelimit
5297048                                test  COMPLETED          1   00:10:00  
5297048.bat+                          batch  COMPLETED          1           
5297048.ext+                         extern  COMPLETED          1       

Can anyone explain what the 'batch' and 'extern' jobs are and what their purpose is. Why does the extern job always complete even when the primary job fails.

I have attempted to search the documentation but have not found a satisfactory and complete answer.

EDIT: Here's the script I am submitting to produce the above sacct output:

#!/bin/bash
echo test_script > done.txt

With the following sbatch command:

sbatch -A BRIDGE-CORE-SL2-CPU --nodes=1 --ntasks=1 -p skylake --cpus-per-task 1 -J jobname -t 00:10:00 --output=./output.out --error=./error.err < test.sh
Parsa
  • 3,054
  • 3
  • 19
  • 35

2 Answers2

17

A Slurm job contains multiple jobsteps, which are all accounted for (in terms of resource usage) separately by Slurm. Usually, these steps are created using srun/mpirun and enumerated starting from 0. But in addition to that, there are sometimes two special steps. For example, take the following job:

sbatch -n 4 --wrap="srun hostname; srun echo Hello World"

This resulted in the following sacct output:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
5163571            wrap     medium      admin          4  COMPLETED      0:0 
5163571.bat+      batch                 admin          4  COMPLETED      0:0 
5163571.ext+     extern                 admin          4  COMPLETED      0:0 
5163571.0      hostname                 admin          4  COMPLETED      0:0 
5163571.1          echo                 admin          4  COMPLETED      0:0 

The two srun calls created the steps 5163571.0 and 5163571.1. The 5163571.bat+ accounts for the ressources needed by the batch script (which in this case is just srun hostname; srun echo Hello World. --wrap just puts that into a file and adds #!/bin/sh).

Many non-MPI programs do a lot of calculations in the batch step, so the ressource usage is accoutned there.

And now for 5163571.ext+: This step accounts for all resources usage by that job outside of slurm. This only shows up, if the PrologFlag contain is used.

An example of a process belonging to a slurm job, but not directly controlled by slurm are ssh sessions. If you ssh into a node where one of your jobs runs, your session will be placed into the context of the job (and you will be limited to your available resources by cgroups, if that is set up). And all calculations you do in that ssh session will be accounted for in the .extern job step.

Marcus Boden
  • 1,495
  • 8
  • 11
0

A job is comprised of several job steps. Each job step is shown independently. In your output you have $JOBID, which stands for the overall reservation, $JOBID.batch, which represents the main script you submitted.

Regarding external... I'm not sure, but I guess that you started one job step and you named it as "external". In that case, that is the information of that job step.

If you show us the script you submitted we can clarify some soubts.

Poshi
  • 5,332
  • 3
  • 15
  • 32
  • 1
    If $JOBID.batch represents the script I submitted, and external is something yet to be resolved what does the job with simply $JOBID represent? – Parsa Sep 23 '18 at 16:52
  • 1
    The $JOBID represents the overall reservation. – Poshi Sep 24 '18 at 11:07
  • If you run `sacct` with `-l` parameter you will see that the information given by each of the lines is different. Maybe you have a reservation of 8 CPUs and two job steps, each of them using 4 CPUs. That will be shown in `sacct`. – Poshi Sep 24 '18 at 11:08
  • Unfortunately, I cannot see neither any jop step being started in your job nor any reference to "external". I have no idea about that. – Poshi Sep 24 '18 at 11:10
  • I think the `.ext` job is common to SLURM, for example you can see in this unrelated question the `sacct` output shows a `.ext` job step. https://bugs.schedmd.com/show_bug.cgi?id=3461#c2 – Parsa Sep 24 '18 at 13:05