Snakemake on SLURM: retry attempt counter does not change

Question

I have a workflow rule with varying time requirements for a slurm cluster. I decorated the workflow as per documentation to increase the slurm time limit for resubmitted jobs (my cluster config.yaml has the arg restart-times: 3), look for resources.time:

example_1:

rule run_toolx:
    ...
    resources:
        cpus=config["smk_params"]["threads"] * config["smk_params"]["thread_factor"],
        mem_mb=4000,
        time=lambda wildcards, attempt: ( 3 * ( attempt ** 2 ))
    threads:
        config["smk_params"]["threads"] * config["smk_params"]["thread_factor"]
    log:
        "logs/{sample}_toolX.log"
    shell:
        """
        echo "Queueing toolX job with a timelimit of {resources.time} minutes."
        toolX --threads {threads} | tee {log} 2> {log}
        """

When I review my snakemake log file for the echo, I find three job submissions with the line: "Queueing toolX job with a timelimit of 3 minutes.". I expected a time limit increase of 3, 12, 27 minutes. This is backed by the slurm logs where each submitted job requests only "00:03:00" computing time on the nodes. When the job hits this timelimit, the slurm manager interrupts it as expected:

output_1:

       JobID    JobName  Partition  AllocCPUS      State    Elapsed  Timelimit ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- 
4983483      snakejob.+     normal          6    TIMEOUT   00:03:15   00:03:00      0:0 
4983483.bat+      batch                     6  CANCELLED   00:03:16                0:15 
4983904      snakejob.+     normal          6    TIMEOUT   00:03:03   00:03:00      0:0 
4983904.bat+      batch                     6  CANCELLED   00:03:04                0:15 
4984740      snakejob.+     normal          6    TIMEOUT   00:03:03   00:03:00      0:0 
4984740.bat+      batch                     6  CANCELLED   00:03:04                0:15

Apparently, either my lambda function does not work or the variable attempt is not a counter.

Thanks for sharing a thought.

Edit_1:

with config["smk_params"]["threads"]=2 and config["smk_params"]["thread_factor"]=3 , so the math for resources.cpus works
This tutorial uses the same notation as my in example above.
The attempt parameter was introduced with [4.1.0] - 2017-09-26. I am using snakemake 6.15.1.
I am using the slurm job submission scripts from Snakemake-Profiles.
some tutorials and posts use 'int(attempt)', this does not solve above problem

Edit_2: Reproducable Example for succesful `attempt` usage

reprex_1:

shell.executable("bash")

localrules: all

rule all:
    input:
        "success.txt"

rule run_toolX:
    output:
        "success.txt",
    resources:
        cpus=1,
        mem_mb=100,
        time=lambda wildcards, input, threads, attempt: ( 3 * ( attempt ** 2 ))
    shell:
        """
        echo "Queueing run_quast with a timelimit of {resources.time} minutes."
        sleep 4m
        echo "Queueing run_quast with a timelimit of {resources.time} minutes." > {output}
        """

yielding:

output_2:

4985470      snakejob.+     normal          1    TIMEOUT   00:03:16   00:03:00      0:0 
4985470.bat+      batch                     1  CANCELLED   00:03:17                0:15 
4985477      snakejob.+     normal          1  COMPLETED   00:04:08   00:12:00      0:0 
4985477.bat+      batch                     1  COMPLETED   00:04:08                 0:0

Experience: I first tried it with a one-minute setup, which was working to my suprise, and then tested the exact same lambda function as in my complex rule resulting in the error. The same slurm profile was used for both examples.

It's quirky.

Edit_3: testing all variations of the notation

my reprex works with or without, respectively:

int(attempt)
int( lambda ... )
lambda wildcards, input, threads, attempt: ...
lambda wildcards, attempt: ...

Edit_4: embedding the reprex in the complex workflow

I am converging with reprex_1 to the setup/environment of the non-functional rule:

reprex_1 works within complex workflow with all rule only pulling reprex_1 rule output
with all rule only pulling example_1 rule output and the resources of example_1 the counter is broken

Next culprit to test: example_1 is a grouped rule. Maybe rule grouping disables the attempt counter.

Would you be able to post a self-contained example and command-line execution that we can copy and paste to easily reproduce the issue? E.g. one without config and wildcards? Also, if you want to add information to your question, edit the question directly instead of posting comments. — dariober, Feb 07 '22 at 16:49
I will write something up and moved my comments into the edit section of my post. — Brendy, Feb 07 '22 at 16:56

score 1 · Answer 1 · answered Feb 07 '22 at 21:01

1

It is a bug and I have raised an issue:

Retry counter attempt is dysfunctional for grouped rules

That was painful.

answered Feb 07 '22 at 21:01

Brendy

41
4

Good detective work! I'm curious, will it work if the non-functional example is the first rule in a group? – Troy Comi Feb 10 '22 at 01:18

score 0 · Answer 2 · answered Feb 07 '22 at 16:59

One possible problem is that the submitted script doesn't use the time option appropriately. Can you check your profile script's cluster option?

This is an incomplete string, but the key idea is to show that the time option should reference the appropriate option in resources:

cluster: "sbatch --parsable -t {resources.time}"

As a debugging experiment, you could also try starting with a specific attempt number:

snakemake -s Snakefile -j 1 --attempt 3 -n

If this doesn't modify the submission characteristics, then there is a problem with snakemake handling of attempt (as opposed to a problem during job submission to SLURM).

An interesting approach, but my complex rule and the reprex make use of the same cluster config and job submission scripts in their [slurm profile](https://github.com/Snakemake-Profiles/slurm). For latter, the `sacct` output shows the correct increase of the counter (see column _Timelimit_ in the code boxes above), hence I do not assume the error in the cluster config. — Brendy, Feb 07 '22 at 18:41

Snakemake on SLURM: retry attempt counter does not change

Edit_2: Reproducable Example for succesful attempt usage

Edit_3: testing all variations of the notation

Edit_4: embedding the reprex in the complex workflow

2 Answers2

Edit_2: Reproducable Example for succesful `attempt` usage