0

Imagine you have a workflow with a wildcard (wc in the example below) and you want to run it for a large number of different values of that wildcard (e.g. 1000 samples). Typically I would create a rule all that takes an input a function that generates the 1000 filenames. But what I've found is that Snakemake will execute one 1000 times then two 1000 times. This is problematic if the intermediate file produced by two is very large, because you end up with 1000 huge files.

Instead, what I would like is for Snakemake to produce five_1.txt ... five_1000.txt, ensuring that it actually produces one output of rule all before moving onto the next. This way, temp() deletes one three_{wc}.txt before the next is produced and you don't end up with a large number of large files.

In a linear workflow, you can use priorities as suggested by @Maarten-vd-Sande. This is because Snakemake looks at the jobs it can perform, and picks the highest priority, which will always be the one further down the chain in the linear workflow. In a fork however, this doesn't work, because both sides of the fork need to be the same priority, but then Snakemake just does all of one rule first.

rule one:
  input: "input_{wc}.txt"
  output: 
    touch("one_{wc}.txt"),
    touch("two_{wc}.txt")
  
rule two:
  input: "one_{wc}.txt"
  output: temp(touch("three_{wc}.txt"))

rule three:
  input: "two_{wc}.txt"
  output: touch("four_{wc}.txt")

rule four:
  input:
    "three_{wc}.txt",
    "four_{wc}.txt"
  output: "five_{wc}.txt"
  shell:
    """
    touch {output}
    """
silver arrow
  • 117
  • 9
  • 1
    Does this answer your question? [Snakemake: Tranverse DAG depth-first?](https://stackoverflow.com/questions/64173399/snakemake-tranverse-dag-depth-first) – Maarten-vd-Sande Feb 01 '21 at 20:22
  • 1
    I just found out that `disk_mb` is a [standard resource](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#standard-resources). You could use that as well to limit the max storage use – Maarten-vd-Sande Feb 04 '21 at 12:44
  • @Maarten-vd-Sande Your solution using priorities would work for the example I gave, so thank you. :) But actually, it doesn't work for my actual workflow because it has a fork in it, whoops. Let me see if I can rewrite this question. – silver arrow Feb 04 '21 at 15:54
  • Not sure if the resources will actually affect the order of rules being executed, will it not just stop anything being executed once the limit is exceeded? – silver arrow Feb 04 '21 at 16:11
  • Never mind, I just make my workflow not a fork but adding a dependency between 2 and 3. It's late :') – silver arrow Feb 04 '21 at 16:18

1 Answers1

0

If it's a fork just make it not a fork. two has to execute before three. Use @Maarten-vd-Sande's solution.

rule one:
  input: "input_{wc}.txt"
  output: 
    touch("one_{wc}.txt"),
    touch("two_{wc}.txt")
  
rule two:
  input: "one_{wc}.txt"
  output: temp(touch("three_{wc}.txt"))
  priority: 1

rule three:
  input: 
    "two_{wc}.txt", 
    "one_{wc}.txt"
  output: touch("four_{wc}.txt")
  priority: 2

rule four:
  input:
    "three_{wc}.txt",
    "four_{wc}.txt"
  output: touch("five_{wc}.txt")
  priority: 3
silver arrow
  • 117
  • 9