Conditional execution of multiplexed analysis with snakemake

Question

I've some troubles with Snakemake, up to now I didn’t found pertinent informations in the documentation (or somewhere else). In fact, I've a big file with different samples (multiplexed analyses) and I would like to stop the execution of the pipeline for some sample according to result found after rules.

I've already tried to change this value out of a rule definition (using a checkpoint or a def), to make conditional input for folowing rules and to considere wildcards as a simple list to delete one item. Below is an example of what I want to do (the conditional if is only indicative here) :

# Import the config file(s)
configfile: "../PATH/configfile.yaml"

# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]

# Rules

rule all:
    input:
        expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)


rule one:
    input:
        "PATH_TO_INPUT/{lauch}.{sample}.input"
    output:
        temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
    shell:
        """
        somescript.sh {input} {output}
        """

rule two:
    input:
        "PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
    output:
        "PATH_TO_OUTPUT/{lauch}.{sample}.output"
    shell:
        """
        somecheckpoint.sh {input}       # Print a message and write in the log file for now

        if [ file_dont_pass_checkpoint ]; then
            # Delete the correspondant sample to the wildcard {sample}
            # to continu the analysis only with samples who are pass the validation
        fi


        somescript2.sh {input} {output}
        """

If someone has an idea I'm interested. Thank you in advance for your answers.

So if your sample passes your checkpoint, the rule should not produce output file and exit successfully? — Manavalan Gajapathy, May 07 '19 at 15:59
I would rather that if my sample passes the checkpoint, the analysis continuous normally. But if my sample not pass the checkpoint, the analysis stop (not produce output file) and exit successfully. — Simon, May 09 '19 at 13:01

dariober · Answer 1 · 2019-05-08T16:02:49.540

2

I think this is an interesting situation if I understand it correctly. If a sample passes some checks, then keep analysing it. Otherwise, stop early.

At the end of the pipeline, every sample must have a PATH_TO_OUTPUT/{lauch}.{sample}.output since this what the rule all asks for regardless of the check results.

You could have the rule(s) performing the checks writing a file containing a flag indicating whether for that sample the checks passed or not (say flag PASS or FAIL). Then according to that flag, the rule(s) doing the analysis either go for the full analysis (if PASS) or write an empty file (or whathever) if the flag is FAIL. Here's the gist:

rule all:
    input:
        expand('{sample}.output', sample= samples),

rule checker:
    input:
        '{sample}.input',
    output:
        '{sample}.check',
    shell:
        r"""
        if [ some_check_is_ok ]
        then
            echo "PASS" > {output}
        else
            echo "FAIL" > {output}
        fi
        """

rule do_analysis:
    input:
        chk= '{sample}.check',
        smp= '{sample}.input',
    output:
        '{sample}.output',
    shell:
        r"""
        if [ {input.chk} contains "PASS"]:
            do_long_analysis.sh {input.smp} > {output}
        else:
            > {output} # Do nothing: empty file
        """

If you don't want to see the failed, empty output files at all, you could use the onsuccess directive to get rid of them at the end of the pipeline:

onsuccess:
    for x in expand('{sample}.output', sample= samples):
        if os.path.getsize(x) == 0:
            print('Removing failed sample %s' % x)
            os.remove(x)

edited May 08 '19 at 16:02

answered May 08 '19 at 15:53

dariober

8,240
3
30
47

Thanks a lot for your answer. I've already considered this solution, but I've a lot of checkpoints during the execution of my pipeline so that implies to continue the analyzes with empty files (if one sample dont pass one of the first checkpoint). This can be problematic for some rules at the end of the pipe. Do you know a way to avoid this problem while avoiding wasting time analysing bad-quality sample ? – Simon May 09 '19 at 13:00
@Simon- Have you checked this [question](https://stackoverflow.com/q/45613881/3998252)? I'm not certain how well this applies to your scenario. Given Snakemake's requirement to know files-to-be-created and their rules beforehand, I'm not certain there is a better solution than above answer and the linked one. – Manavalan Gajapathy May 10 '19 at 02:24
@Simon The main problem I see with my solution is that you have to repeat the if-else block in every rule that like `do_analysis` is conditional on the sample being PASS or FAIL. This can be cumbersome to code and there may be better strategies but in terms of computation is very lightweight. Maybe you should explain what it makes it *problematic for some rules at the end of the pipe* – dariober May 10 '19 at 08:55
Thanks @JeeYern for this link, I didn’t see this page during my search on the subject. I'm going to test this solution. I could maybe try to pass all the rules for a sample after a fail checkpoint (with some condionnal check for the following rules) and, at the end, edit an empty file for the “rule all” like in the Dariober's solution. – Simon May 13 '19 at 15:55
@Dariober Yes you're right, your solution seems to be very fast to run. What I mean by "problematic for some rules at the end of the pipe" that I've some rules in my pipeline which do not accept empty file as input. I can try to add a conditional like : IF [ NOT PASS || EMPTY FILE ]; then ... So your solution seems to be the better for now (i've tried that this week-end) and seems to run. Thanks you a lot. – Simon May 13 '19 at 15:55

score 1 · Answer 2 · answered Sep 11 '20 at 13:29

The canonical solution to problems like this is to use checkpoints. Consider the following example:

import pandas as pd

def get_results(wildcards):
    qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
    return expand(
        "results/processed/{sample}.txt", 
        sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
    )


rule all:
    input:
        get_results


checkpoint qc:
    input:
        expand("results/preprocessed/{sample}.txt", sample=config["samples"])
    output:
        "results/qc.tsv"
    shell:
        "perfom-qc {input} > {output}"


rule process:
    input:
        "results/preprocessed/{sample}.txt"
    output:
        "results/processed/{sample.txt}"
    shell:
        "process {input} > {output}"

The idea is the following: at some point in your pipeline, after some (let's say) preprocessing, you add a checkpoint rule, which aggregates over all samples and generates some kind of QC table. Then, downstream of that, there is a rule that aggregates over samples (e.g. the rule all, or some other aggregation inside of the workflow). Let's say in that aggregation you only want to consider samples that pass the QC. For that, you let the required files ("results/processed/{sample}.txt") be determined via an input function, which reads the QC table generated by the checkpoint rule. Snakemake's checkpoint mechanism ensures that this input function is evaluated after the checkpoint has been executed, so that you can actually read the table results and base your decision about the samples on the qc criteria contained in that table. Any intermediate rules (like here the process rule) will then be automatically applied by Snakemake when re-evaluating the DAG.

Conditional execution of multiplexed analysis with snakemake

2 Answers2

Linked