Snakemake: Avoid removing output files before executing the shell command

Question

Is there a possibility to avoid that the output files defined in a snakemake rule are deleted before executing the shell command? I found a description of this behaviour here: http://snakemake.readthedocs.io/en/stable/project_info/faq.html#can-the-output-of-a-rule-be-a-symlink

What I am trying to do is definining a rule for a list of input and a list of output files (N:M relation). This rule should be triggered if one of the input files has changed. The python script which is called in the shell command then creates only those output which do not exist or whose content has changed in comparison to the already existing files (i.e. a change detection is implemented inside the python script). I expected that something like the following rule should solve this, but as the output.jsons are deleted before running the python script, all output.jsons will be created with a new timestamp instead of only those which have changed.

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
output:
    jsons = ["transformation/{section}_transformation.json".format(section=s) for s in SECTIONS]
shell:
    "python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {output.jsons}"

If there is no possibility to avoid the deletion of output files in Snakemake, does anybody has another idea how to map this workflow into a snakemake rule without updating all output files?

Update:

I tried to solve this problem by changing the Snakemake source code. I removed the line self.remove_existing_output() in jobs.py to avoid removing output files before executing a rule. Furthermore, I added the parameter no_touch=True when self.dag.check_and_touch_output() is called in executors.handle_job_success. This worked great as the output files now were neither removed before nor touched after the rule is executed. But following rules with json files as input are still triggered for each json file (even if it did not change) as Snakemake recognizes that the json file was defined as an output before and theremore must have been changed. So I think avoiding the deletion of output files does not solve my problem, maybe a workaround - if existing - is the only way...

Update 2:

I also tried to find a workaround without changing the Snakemake source code by changing the output path of the above defined jsons rule to transformation/tmp/... and adding the following rule:

def cmp_jsons(wildcards):
    section = int(wildcards.section)
    # compare json for given section in transformation/ with json in transformation/tmp/
    # return [] if json did not change
    # return path to tmp json filename if json has changed
rule copy:
    input:
        json_tmp = cmp_jsons
    output:
        jsonfile = "transformation/B21_{section,\d+}_affine_transformation.json"
    shell:
        "cp {input.json_tmp} {output.jsonfile}"

But as the input function is evaluated before the workflow starts, the tmp-jsons are either not yet existing or not yet updated by the jsons rule and therefore the comparison won't be correct.

Foldager · Answer 1 · 2018-03-14T20:44:41.573

I do not think Snakemake currently has a solution to your problem. I think you would have to pull the input/output logic out of create_transformation_jsons.py and write separate rules for each relation in the Snakefile. It might be helpfull for you to know that anonymous rules can be generated e.g. inside a for loop. How to deal with a variable of output files in a rule.

Recently Snakemake started clearing logs when executing a rule, and I have opened an issue on that. A solution to that problem could possible help you too. But that is all in the uncertain future, so don't count on it.

Update

Here is another approach. You do not have any wildcards in your rule, so I assume that you are only running the rule once. I also assume that at the time of execution you can make a list of sections that are being updated. I've called the list SECTIONS_PRUNED. Then you can make a rule that only marks these files as outputfiles.

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
output:
    jsons = ["transformation/{section}_transformation.json".format(section=s) for s in SECTIONS_PRUNED]
params:
    jsons = [f"transformation/{s}_transformation.json" for s in SECTIONS]
run:
    shell("python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}")

I Initially thought it would be a good idea to use shadow: "minimal" to ensure that any files that SECTIONS_PRUNED fails to declare are not spuriously updated. However, the case with shadow might be worse: That missed files are updated and left behind in the shadow directory (and deleted unnoticed). With shadow you would also need to copy the json files into the shadow directory to let your script figure out what to generate.

So the better solution is probably to not use shadow. If SECTIONS_PRUNED fail to declare all the files that are updated, a second execution of snakemake will highligt (and fix) this and ensure all downstream analyses are completed correctly.

Update 2

Yet another, and simpler, approach would be to split you workflow into two, by not letting snakemake know that the json rule produces outputfiles.

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
params:
    jsons = [f"transformation/{s}_transformation.json" for s in SECTIONS]
shell:
    "python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}"

Run snakemake in two parts, replacing all with the relevant rule name.

$ snakemake jsons
$ snakemake all

Thanks for your answer. I think the problem in my Snakemake workflow is that I don't know which json files will be changed in the rule jsons, so I don't know how to define anonymous rules in a for loop which is interpreted before the workflow starts and before I know for which output files an anonymous rule is required. Do you have a concrete code example? Solving the linked Snakemake issue really could be a solution for my problem, but it should be considered that not only removing the output files is a problem but also triggering following rules (see update in my question). — SarahH, Mar 14 '18 at 08:25
I have updated my answer. However, you will need a way to figure out what files you are updating. Can't you take the code for that from `create_transformation_jsons.py`? — Foldager, Mar 14 '18 at 20:29
You are right, I'm only running the rule once, but I still have no idea how to define a list `SECTIONS_PRUNED` at runtime. As far as I know, all python definitions are evaluated before the workflow starts, but I need to run a prior rule in order to know which jsons will be updated. I have already tried something similar (see my 2nd update). Your 2nd update is the best I have so far, thanks! I tried to split the workflow before but used output instead of params. But this only worked with the Snakemake source coude changes. However, a solution with only one snakemake run would be the best. — SarahH, Mar 16 '18 at 08:51
@SarahH, do you have a way of figuring out which jsons files will be updated given which match files will be created/updated? Then I can help you. — Foldager, Mar 16 '18 at 15:06
Yes, the list of json files to update depends on the list of updated/created match files. It is not straight forward to derive the jsons from the matches files but it should be possible. — SarahH, Mar 19 '18 at 08:19

score 1 · Accepted Answer · answered Mar 19 '18 at 19:25

This is a bit more involved, but I think it would work seamlessly for you.

The solution involves calling snakemake twice, but you can wrap it up in a shell script. In the first call you use snakemake in --dryrun to figure out which jsons will be updated, and in the second call this info is used to make the DAG. I use --config to switch between the two modes. Here is the Snakefile.

def get_match_files(wildcards):
    """Used by jsons_fake to figure which match files each json file depend on"""
    section = wildcards.section

    ### Do stuff to figure out what matching files this json depend on
    # YOUR CODE GOES HERE
    idx = SECTIONS.index(int(section)) # I have no idea if this is what you need
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[idx], SECTIONS[idx + 1])]

    return matchfiles

def get_json_output_files(fn):
    """Used by jsons. Read which json files will be updated from fn"""
    try:
        json_files = []
        with open(fn, 'r') as fh:
            for line in fh:
                if not line:
                    continue  # skip empty lines
                split_line = line.split(maxsplit=1)
                if split_line[0] == "output:":
                    json_files.append(split_line[1])  # Assumes there is only 1 output file pr line. If more, modify.
    except FileNotFoundError:
        print(f"Warning, could not find {fn}. Updating all json files.")
        json_files = expand("transformation/{section}_transformation.json", section=SECTIONS)

    return json_files


if "configuration_run" in config:
    rule jsons_fake:
        "Fake rule used for figuring out which json files will be created."
        input:
            get_match_files
        output:
            jsons = "transformation/{section}_transformation.json"
        run:
            raise NotImplementedError("This rule is not meant to be executed")

    rule jsons_all:
        input: expand("transformation/{s}_transformation.json", s=SECTIONS]

else:
    rule jsons:
        "Create transformation files out of landmark correspondences."
        input:
            matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
        output:
            jsons = get_json_output_files('json_dryrun') # This is called at rule creation
        params:
            jsons=expand("transformation/{s}_transformation.json", s=SECTIONS]
        run:
            shell("python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}")

To avoid calling Snakemake twice you can wrap it in a shell script, mysnakemake

#!/usr/bin/env bash

snakemake jsons_all --dryrun --config configuration_run=yes | grep -A 2 'jsons_fake:' > json_dryrun
snakemake $@

And call the script like you would normally call snakemake, eg: mysnakemake all -j 2. Does this work for you? I haven't tested all parts of the code, so take it with a grain of salt.

This works great! Thank you very much! Although Snakemake must be called twice, it is definitively better than calling two different rules with a long execution time between both calls. I just had to add `.rstrip()` when adding `split_line[1]` to the list of json files. Apart from that your code was running immediately after I implemented `get_match_files()`. — SarahH, Mar 21 '18 at 08:26

Snakemake: Avoid removing output files before executing the shell command

2 Answers2