6

I am relatively new to snakemake, and I am having some trouble adapting a scatter-gather DeepVariant workflow into snakemake rules.

In the original Snakefile, I would like to scatter the first step across a cluster. DeepVariant uses a *.00001-of-00256.* format to track the shard number in an intermediate file format, so I need to use string formatting to supply both the shard number and the total number of shards within input, output, and shell fields, and I provide the shard number as a wildcard in the params of the scatter rule. The expand() function in the input field of the gather rule is correctly generating the expected filenames, but it is unable to find the input file paths that would be generated by the scatter step.

I have generated a minimal reproducible example below, as well as the output of running this example (lightly redacted to remove some path information).

N_SHARDS = 8

rule all:
    input: "done.txt"


rule scatter:
    input: "start.txt"
    output: f"test_{{shard:05}}-of-{N_SHARDS:05}.txt"
    params:
        shard = range(N_SHARDS)
    message: "scattering"
    shell:
        f"echo {{wildcards.shard}} {N_SHARDS} > {{output}}"


rule gather:
    input: expand(f"test_{{shard:05}}-of-{N_SHARDS:05}.txt", shard=range(N_SHARDS))
    output: touch("done.txt")
    shell: "echo gathering"
$ touch start.txt
$ snakemake -s example.smk -j 1
Building DAG of jobs...
MissingInputException in line 17 of /redacted/example.smk:
Missing input files for rule gather:
test_00002-of-00008.txt
test_00000-of-00008.txt
test_00006-of-00008.txt
test_00001-of-00008.txt
test_00004-of-00008.txt
test_00005-of-00008.txt
test_00007-of-00008.txt
test_00003-of-00008.txt

I have built very similar rules for other scatter-gather concepts that do not require string formatting of wild cards, so that is the only thing I can think of that is different in this case. I would appreciate any insights!

UPDATE: A helpful twitter user noted that I can remove the :05 in scatter->output and the rule works. This is great, and it happens to solve my original problem, but only because DeepVariant is tolerant of zero-padding for the shard parameter passed at the command line. Is there a solution that allows me to apply formatting to a wildcard?

1 Answers1

4

This is how I would do it:

N_SHARDS = '00008'

shard = ['%05d' % x for x in range(int(N_SHARDS))]

wildcard_constraints:
    shard= '|'.join([re.escape(x) for x in shard])

rule all:
    input: 
        "done.txt",

rule scatter:
    input: 
        "start.txt",
    output:
        "test_{shard}-of-%s.txt" % N_SHARDS,
    shell:
        r"""
        echo {wildcards.shard} %s > {output}"
        """ % N_SHARDS
    
rule gather:
    input:
        expand('test_{shard}-of-%s.txt' % N_SHARDS, shard= shard),
    output: 
        touch("done.txt")
    shell: 
        "echo gathering"

The wildcard_constraints bit may be redundant but I tend use it quite liberally if I know exactly what values the wildcards are going to take.

One thing: You seem to know before hand how many shards DeepVariant is going to generate (N_SHARDS = 8 in the example). Is this actually the case? If not, I think you need to resort on the checkpoint functionality of snakemake.

C. Braun
  • 5,061
  • 19
  • 47
dariober
  • 8,240
  • 3
  • 30
  • 47