I am running a large snakemake
(v5.3.0) pipeline using a slurm
scheduler (v14.11.4). Unfortunately ~1/1000 jobs crash with a NODE_FAILED (ExitCode 0) which snakemake does not recognise, leading to half finished output files.
In order to make snakemake aware of the incident I figured out that --cluster-status
and a script that parses the jobid
using sacct
should do the trick. To do the job I modified a script I found online, which now looks like that:
#!/usr/bin/env python3
import os
import sys
import warnings
import subprocess
jobid = sys.argv[1]
state = subprocess.run(['sacct','-j',jobid,'--format=State'],stdout=subprocess.PIPE).stdout.decode('utf-8')
state = state.split('\n')[2].strip()
map_state={"PENDING":'running',
"RUNNING":'running',
"SUSPENDED":'running',
"CANCELLED":'failed',
"COMPLETING":'running',
"COMPLETED":'success',
"CONFIGURING":'running',
"FAILED":'failed',
"TIMEOUT":'failed',
"PREEMPTED":'failed',
"NODE_FAIL":'failed',
"REVOKED":'failed',
"SPECIAL_EXIT":'failed',
"":'success'}
print(map_state[state])
The script works fine in the command line. However, when initiating snakemake as followed:
SM_ARGS="--cpus-per-task {cluster.cpus-per-task} --mem-per-cpu {cluster.mem-per-cpu-mb} --job-name {cluster.job-name} --ntasks {cluster.ntasks} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}"
snakemake -p \
$* \
--latency-wait 120 \
-j 600 \
--cluster-config $(dirname $0)/cluster.slurm.json \
--cluster "sbatch $SM_ARGS" \
--cluster-status ~/scripts/snakemake/slurm_status.py
It starts to submit the first batch of 600 jobs and basically stalls right afterwards with no additional jobs submitted. However, all initially submitted jobs finish successfully. The snakemake log produces after all jobs are submitted a single error:
sacct: error: slurmdbd: Getting response to message type 1444
sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: No error
I assume my command does not parse the jobid
correctly to slurm_status.py
. However, I do not know how snakemake parses the jobid
to the slurm_status.py
and google could not answer this question (neither the sparse info obtained via snakemake --help
).
Thanks for your support.