3

I am running a large snakemake (v5.3.0) pipeline using a slurm scheduler (v14.11.4). Unfortunately ~1/1000 jobs crash with a NODE_FAILED (ExitCode 0) which snakemake does not recognise, leading to half finished output files.

In order to make snakemake aware of the incident I figured out that --cluster-status and a script that parses the jobid using sacct should do the trick. To do the job I modified a script I found online, which now looks like that:

#!/usr/bin/env python3
import os
import sys
import warnings
import subprocess

jobid = sys.argv[1]
state = subprocess.run(['sacct','-j',jobid,'--format=State'],stdout=subprocess.PIPE).stdout.decode('utf-8')
state = state.split('\n')[2].strip()

map_state={"PENDING":'running',
       "RUNNING":'running', 
       "SUSPENDED":'running', 
       "CANCELLED":'failed', 
       "COMPLETING":'running', 
       "COMPLETED":'success', 
       "CONFIGURING":'running', 
       "FAILED":'failed',
       "TIMEOUT":'failed',
       "PREEMPTED":'failed',
       "NODE_FAIL":'failed',
       "REVOKED":'failed',
       "SPECIAL_EXIT":'failed',
       "":'success'}

print(map_state[state])

The script works fine in the command line. However, when initiating snakemake as followed:

SM_ARGS="--cpus-per-task {cluster.cpus-per-task} --mem-per-cpu {cluster.mem-per-cpu-mb} --job-name {cluster.job-name} --ntasks {cluster.ntasks} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}"

snakemake -p \
$* \
 --latency-wait 120 \
-j 600 \
--cluster-config $(dirname $0)/cluster.slurm.json \
--cluster "sbatch $SM_ARGS" \
--cluster-status ~/scripts/snakemake/slurm_status.py

It starts to submit the first batch of 600 jobs and basically stalls right afterwards with no additional jobs submitted. However, all initially submitted jobs finish successfully. The snakemake log produces after all jobs are submitted a single error:

sacct: error: slurmdbd: Getting response to message type 1444
sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: No error

I assume my command does not parse the jobid correctly to slurm_status.py. However, I do not know how snakemake parses the jobid to the slurm_status.py and google could not answer this question (neither the sparse info obtained via snakemake --help).

Thanks for your support.

Feliks
  • 154
  • 2
  • 9

1 Answers1

1

I never used snakemake, but I have a guess. From snakemake documentation:

For this it is necessary that the submit command provided to –cluster returns the cluster job id.

But your -cluster command does not return a job id. It returns a string with a job id at the end. You can try to add the parameter --parsable to the sbatch invocation. According to the manual:

Outputs only the job id number and the cluster name if present. The values are separated by a semicolon. Errors will still be displayed.

If that does not work, you will have to work your way to get a clean job id from sbatch. Maybe you can encapsulate the sbatch command in another script that parses the output:

!#/bin/bash

sbatch "$@" | awk '{print $4}'
Poshi
  • 5,332
  • 3
  • 15
  • 32
  • Thx @Poshi. I wasn't sure what is exactly parsed into the py script. However, I added `--parsable` to the sbatch command and changed the argument used for parsing in the py script to `sys.argv[-1]` and it finally works. – Feliks Dec 25 '18 at 06:04