7

I'm using snakemake v 5.4.0, and I'm running into a problem with temp(). In a hypothetical scenario:

Rule A --> Rule B1 --> Rule C1
     |
      --> Rule B2 --> Rule C2 

where Rule A generates temp() files used by both pathways 1 (B1 + C1) and 2 (B2 + C2).

If I run the pipeline, the temp() files generated by RuleA are deleted by after they are used in both pathways, which is what I expect. However, if I then want to re-run Pathway 2, the temp() files for RuleA must be recreated which triggers the re-run of the entire pipeline, not just Pathway2. This becomes very computationally expensive for long pipelines. Is there a good way to prevent this besides not using temp(), which in my case would require many TB of extra hard drive space?

sharchaea
  • 743
  • 1
  • 6
  • 16

1 Answers1

0

You could create the list of input files to rule all, or whatever the first rule is called, dynamically depending on whether the output of Pathway 2 already exists (and satisfies some sanity checks).

output= ['P1.out']
if not os.path.exists('P2.out'): # Some more conditions here...
    output.append('P2.out')

rule all:
    input:
        output

rule make_tmp:
    output:
        temp('a.out')
    shell:
        r"""
        touch {output}
        """

rule make_P1:
    input:
        'a.out'
    output:
        'P1.out'
    shell:
        r"""
        touch {output}
        """

rule make_P2:
    input:
        'a.out'
    output:
        'P2.out'
    shell:
        r"""
        touch {output}
        """

However, this somewhat defeats the point of using snakemake. If the input of Pathway 1 has to be recreated, how can you be sure that its output is still up-to-date?

dariober
  • 8,240
  • 3
  • 30
  • 47