I am relatively new to snakemake, and I am having some trouble adapting a scatter-gather DeepVariant workflow into snakemake rules.
In the original Snakefile, I would like to scatter the first step across a cluster. DeepVariant uses a *.00001-of-00256.*
format to track the shard number in an intermediate file format, so I need to use string formatting to supply both the shard number and the total number of shards within input
, output
, and shell
fields, and I provide the shard number as a wildcard in the params
of the scatter
rule. The expand()
function in the input
field of the gather
rule is correctly generating the expected filenames, but it is unable to find the input file paths that would be generated by the scatter
step.
I have generated a minimal reproducible example below, as well as the output of running this example (lightly redacted to remove some path information).
N_SHARDS = 8
rule all:
input: "done.txt"
rule scatter:
input: "start.txt"
output: f"test_{{shard:05}}-of-{N_SHARDS:05}.txt"
params:
shard = range(N_SHARDS)
message: "scattering"
shell:
f"echo {{wildcards.shard}} {N_SHARDS} > {{output}}"
rule gather:
input: expand(f"test_{{shard:05}}-of-{N_SHARDS:05}.txt", shard=range(N_SHARDS))
output: touch("done.txt")
shell: "echo gathering"
$ touch start.txt
$ snakemake -s example.smk -j 1
Building DAG of jobs...
MissingInputException in line 17 of /redacted/example.smk:
Missing input files for rule gather:
test_00002-of-00008.txt
test_00000-of-00008.txt
test_00006-of-00008.txt
test_00001-of-00008.txt
test_00004-of-00008.txt
test_00005-of-00008.txt
test_00007-of-00008.txt
test_00003-of-00008.txt
I have built very similar rules for other scatter-gather concepts that do not require string formatting of wild cards, so that is the only thing I can think of that is different in this case. I would appreciate any insights!
UPDATE: A helpful twitter user noted that I can remove the :05
in scatter
->output
and the rule works. This is great, and it happens to solve my original problem, but only because DeepVariant is tolerant of zero-padding for the shard parameter passed at the command line. Is there a solution that allows me to apply formatting to a wildcard?