Handling equivalent file extensions in Snakemake

Question

Essentially, I want to know what the recomended way of handling equivalent file extensions is in snakemake. For example, lets say I have a rule that counts the number of entries in a fasta file. The rule might look something like....

rule count_entries:
    input:
        ["{some}.fasta"]
    output:
        ["{some}.entry_count"]
    shell:
        'grep -c ">" {input[0]} > {output[0]}'

This works great. But what if I want this rule to also permit "{some}.fa" as input?

Is there any clean way to do this?

EDIT:

Here is my best guess at the first proposed sollution. This can probably be turned into a higher order function to be more general purpose but this is the basic idea as I understand it. I don't think this idea really fits any general use case though as it doesn't cooperate with other rules at the "building DAG" stage.

import os

def handle_ext(wcs):
    base = wcs["base"]
    for file_ext in [".fasta", ".fa"]:
        if(os.path.exists(base + file_ext)):
            return [base + file_ext]

rule count_entries:
    input:
        handle_ext
    output:
        ["{base}.entry_count"]
    shell:
        'grep -c ">" {input[0]} > {output[0]}'

EDIT2: Here is the best current sollution as I see it...

count_entries_cmd = 'grep -c ">" {input} > {output}'
count_entries_output = "{some}.entry_count"

rule count_entries_fasta:
    input:
        "{some}.fasta"
    output:
        count_entries_output
    shell:
        count_entries_cmd

rule count_entries_fa:
    input:
        "{some}.fa"
    output:
        count_entries_output
    shell:
        count_entries_cmd

Just using a python function may be the best possible solution. I don't believe snakemake has any in-built solution for this. — Manavalan Gajapathy, May 30 '19 at 18:38
See this for using [function as input files](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files). — Manavalan Gajapathy, May 30 '19 at 19:29
See my edit in OP for an example that fits this use case. It still seems pretty ugly and awkward. Also it doesn't really work with other rules well or at all really as it only knows whether or not a .fasta is the correct input once the file exists. — Nolan Hartwick, May 30 '19 at 19:33

Dmitry Kuzminov · Answer 1 · 2019-06-01T17:09:26.123

1

One thing I noticed is that you are trying to specify lists of files in both input and output sections but actually your rule takes a single file and produces another file. I propose you a straightforward solution of specifying two separate rules for different extensions:

rule count_entries_fasta:
    input:
        "{some}.fasta"
    output:
        "{some}.entry_count"
    shell:
        'grep -c ">" {input} > {output}'

rule count_entries_fa:
    input:
        "{some}.fa"
    output:
        "{some}.entry_count"
    shell:
        'grep -c ">" {input} > {output}'

These rules are not ambiguous unless you keep files with the same {some} name and different extension in the same folder (which I hope you don't do).

edited Jun 01 '19 at 17:09

answered Jun 01 '19 at 15:55

Dmitry Kuzminov

6,180
6
18
40

1

This seems like the best sollution. My only reservation is that it contains a lot of duplicated code. This can be streamlined in obvious ways to some extent by pulling out the output and shell strings as variables and just using those variables in the rule definitions. Ideally these rules could be generated instead of requiring manual duplication of code but I'm not sure this is possible. Can rules be generated by a function? if so how? – Nolan Hartwick Jun 03 '19 at 18:14
@NolanHartwick You don't need to duplicate code. For example you may make shell contents a separate function. The rest is just minimal code to express your idea: take the output, match it with the input. – Dmitry Kuzminov Jun 03 '19 at 18:27
1

The rule definitions themselves are duplicated code. It may be unavoidable and it means this sollution doesn't generalize well to more file types. It doesn't look too bad to have two versions of the same rule. It would look a lot worse if there were four, or in potential extreme cases many more. In these (rare I admit) cases, it would be great if rules could be created by a function call or something so that you could just specify the base rule and the set of equivalent input and then create all the rules that are needed. I don't think this is possible though. – Nolan Hartwick Jun 03 '19 at 18:41

merv · Answer 2 · 2019-06-01T17:58:08.927

0

One possible solution is to only allow the original rule to take .fasta files as input, but enable .fa files to be renamed to that. For example,

rule fa_to_fasta:
    input:
        "{some}.fa"
    output:
        temp("{some}.fasta")
    shell:
        """
        cp {input} {output}
        """

Clearly this has the disadvantage of making a temporary copy of the file. Also, if foo.fa and foo.fasta are both provided (not through the copying), then foo.fasta will silently overshadow foo.fa, even if they are different.

edited Jun 01 '19 at 17:58

answered Jun 01 '19 at 03:19

merv

67,214
13
180
245

1

Even if that is not good, that happens a lot that there could be different ways to create the same file. You can always resolve the ambiguity explicitly. Moving the source file is much worse solution anyway. – Dmitry Kuzminov Jun 01 '19 at 16:06
Moreover there is no ambiguity if there is either `fasta` of `fa` file in the folder. The real ambiguity is when you have both at the same time, and in this case changing the file name doesn't work either. – Dmitry Kuzminov Jun 01 '19 at 16:21
@DmitryKuzminov While I concede that Snakemake will only *raise* the ambiguity exception in the case where both potential inputs are present, the ambiguity I'm considering is the formal one that is necessary to avoid in order to have a reproducible pipeline. Effectively, Snakemake rules define a map from an output file to a poset of input files. If each output file has a single determinable poset, then the pipeline as a whole is unambiguous. Just because Snakemake doesn't test for formal reproducibility doesn't mean that one shouldn't be aiming for it. – merv Jun 01 '19 at 17:08
1

To start with, Snakemake allows you to explicitly resolve the ambiguity using the `ruleorder` statement. That works in case there is a real ambiguity when Snakemake fails to decide which rule it should use (thus produces the `AmbiguousRuleException`). But in the example with fasta/fa there is no ambiguity unless there are files with exactly the same name but different extensions (which I consider to be a more severe logical problem then just having an ambiguity). But moving the files in this case doesn't help either. – Dmitry Kuzminov Jun 01 '19 at 17:16
@DmitryKuzminov good point. I see now that one would actually *want* to trigger the ambiguity exception if there were two files with the same base, because it would warn the user that there is a real logical problem with their input. In my suggested solution, it would silently ignore the `.fa` file even if there was a real difference between the two. Thanks for your input! – merv Jun 01 '19 at 17:25
As for your solution I would take it as acceptable in case of some modifications. The main point: the source shall not be changed. At the end of the day you may have no permissions to move the file. This however would give you a chance to copy the .fa file into a temporary folder with the other extension. Even that could be disadvantageous in case the source file changes, but flagging the newly created fasta as temporary solves the issue. – Dmitry Kuzminov Jun 01 '19 at 17:30

Oliver Küchler · Answer 3 · 2022-08-28T19:52:21.977

Even though OP has edited his entry and included the possible workaround via the input functions, I think it is best to list it also here as an answer to highlight this as possible solution. At least for me, this was the case :)

So, for example if you have an annotation table for your samples, which includes the respective extensions for each sample-file (e.g. via PEP), then you can create a function that returns these entries and pass this function as input to a rule. My example:

# Function indicates needed input files, based on given wildcards (here: sample) and sample annotations
# In my case the sample annotations were provided via PEP
def get_files_dynamically(wildcards):
    sample_file1 = pep.sample_table["file1"][wildcards.sample]
    sample_read2 = pep.sample_table["file"][wildcards.sample]

    return {"file1": sample_file1, "file2": sample_file2}


# 1. Perform trimming on fastq-files
rule run_rule1:
    input:
        unpack(get_files_dynamically) # Unpacking allows naming the inputs
    output:
        output1="output/somewhere/{sample}_1.xyz.gz",
        output2="output/somewhere/{sample}_2.xyz.gz"
    shell:
        "do something..."

Yes, thanks for your comment. I fixed it in my answer :) – Oliver Küchler Aug 28 '22 at 19:52 — Oliver Küchler, Aug 28 '22 at 19:52

Handling equivalent file extensions in Snakemake

3 Answers3