I have a workflow rule with varying time requirements for a slurm cluster. I decorated the workflow as per documentation to increase the slurm time limit for resubmitted jobs (my cluster config.yaml
has the arg restart-times: 3
), look for resources.time
:
example_1:
rule run_toolx:
...
resources:
cpus=config["smk_params"]["threads"] * config["smk_params"]["thread_factor"],
mem_mb=4000,
time=lambda wildcards, attempt: ( 3 * ( attempt ** 2 ))
threads:
config["smk_params"]["threads"] * config["smk_params"]["thread_factor"]
log:
"logs/{sample}_toolX.log"
shell:
"""
echo "Queueing toolX job with a timelimit of {resources.time} minutes."
toolX --threads {threads} | tee {log} 2> {log}
"""
When I review my snakemake log file for the echo, I find three job submissions with the line: "Queueing toolX job with a timelimit of 3 minutes.". I expected a time limit increase of 3, 12, 27 minutes. This is backed by the slurm logs where each submitted job requests only "00:03:00" computing time on the nodes. When the job hits this timelimit, the slurm manager interrupts it as expected:
output_1:
JobID JobName Partition AllocCPUS State Elapsed Timelimit ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- --------
4983483 snakejob.+ normal 6 TIMEOUT 00:03:15 00:03:00 0:0
4983483.bat+ batch 6 CANCELLED 00:03:16 0:15
4983904 snakejob.+ normal 6 TIMEOUT 00:03:03 00:03:00 0:0
4983904.bat+ batch 6 CANCELLED 00:03:04 0:15
4984740 snakejob.+ normal 6 TIMEOUT 00:03:03 00:03:00 0:0
4984740.bat+ batch 6 CANCELLED 00:03:04 0:15
Apparently, either my lambda function does not work or the variable attempt is not a counter.
Thanks for sharing a thought.
Edit_1:
- with config["smk_params"]["threads"]=2 and config["smk_params"]["thread_factor"]=3 , so the math for
resources.cpus
works - This tutorial uses the same notation as my in example above.
- The
attempt
parameter was introduced with [4.1.0] - 2017-09-26. I am using snakemake 6.15.1. - I am using the slurm job submission scripts from Snakemake-Profiles.
- some tutorials and posts use 'int(attempt)', this does not solve above problem
Edit_2: Reproducable Example for succesful attempt
usage
reprex_1:
shell.executable("bash")
localrules: all
rule all:
input:
"success.txt"
rule run_toolX:
output:
"success.txt",
resources:
cpus=1,
mem_mb=100,
time=lambda wildcards, input, threads, attempt: ( 3 * ( attempt ** 2 ))
shell:
"""
echo "Queueing run_quast with a timelimit of {resources.time} minutes."
sleep 4m
echo "Queueing run_quast with a timelimit of {resources.time} minutes." > {output}
"""
yielding:
output_2:
4985470 snakejob.+ normal 1 TIMEOUT 00:03:16 00:03:00 0:0
4985470.bat+ batch 1 CANCELLED 00:03:17 0:15
4985477 snakejob.+ normal 1 COMPLETED 00:04:08 00:12:00 0:0
4985477.bat+ batch 1 COMPLETED 00:04:08 0:0
Experience: I first tried it with a one-minute setup, which was working to my suprise, and then tested the exact same lambda function as in my complex rule resulting in the error. The same slurm profile was used for both examples.
It's quirky.
Edit_3: testing all variations of the notation
my reprex works with or without, respectively:
int(attempt)
int( lambda ... )
lambda wildcards, input, threads, attempt: ...
lambda wildcards, attempt: ...
Edit_4: embedding the reprex in the complex workflow
I am converging with reprex_1 to the setup/environment of the non-functional rule:
- reprex_1 works within complex workflow with all rule only pulling reprex_1 rule output
- with all rule only pulling example_1 rule output and the
resources
of example_1 the counter is broken
Next culprit to test: example_1 is a grouped rule. Maybe rule grouping disables the attempt
counter.