Imagine you have a workflow with a wildcard (wc
in the example below) and you want to run it for a large number of different values of that wildcard (e.g. 1000 samples). Typically I would create a rule all
that takes an input a function that generates the 1000 filenames. But what I've found is that Snakemake will execute one
1000 times then two
1000 times. This is problematic if the intermediate file produced by two
is very large, because you end up with 1000 huge files.
Instead, what I would like is for Snakemake to produce five_1.txt
... five_1000.txt
, ensuring that it actually produces one output of rule all
before moving onto the next. This way, temp()
deletes one three_{wc}.txt
before the next is produced and you don't end up with a large number of large files.
In a linear workflow, you can use priorities as suggested by @Maarten-vd-Sande. This is because Snakemake looks at the jobs it can perform, and picks the highest priority, which will always be the one further down the chain in the linear workflow. In a fork however, this doesn't work, because both sides of the fork need to be the same priority, but then Snakemake just does all of one rule first.
rule one:
input: "input_{wc}.txt"
output:
touch("one_{wc}.txt"),
touch("two_{wc}.txt")
rule two:
input: "one_{wc}.txt"
output: temp(touch("three_{wc}.txt"))
rule three:
input: "two_{wc}.txt"
output: touch("four_{wc}.txt")
rule four:
input:
"three_{wc}.txt",
"four_{wc}.txt"
output: "five_{wc}.txt"
shell:
"""
touch {output}
"""