3

I was wondering if there is a way to have optional inputs in rules. An example case is excluding unpaired reads for alignment (or having only unpaired reads). A pseudo rule example:

rule hisat2_align:
    input:
        rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else '',
        r1: lambda wildcards: '-1 '+ read_files[wildcards.reads]['R1'],
        r2: lambda wildcards: '-2 '+ read_files[wildcards.reads]['R2']
    output:
        'aligned.sam'
    params:
        idx: 'index_prefix',
        extra: ''
    shell:
        'hisat2 {params.extra} -x {params.idx} {input.rU} {input.r1} {input.r2}'

Here, not having trimmed reads (rU='') would result in missing input file error. I can go around this through a duplicate rule with adjusted input/shell statement or handling the input through params (i'm sure there are other ways). I'm trying to handle this neatly so that this step can be run through a snakemake wrapper (currently a custom one).

The closest example I've seen is on https://groups.google.com/d/msg/snakemake/qX7RfXDTDe4/XTMOoJpMAAAJ and Johannes' answer. But there we have a conditional assignment (eg. input: 'a' if condition else 'b') not an optional one.

Any help/guidance will be appreciated.

ps. optional input can help with varying number of hisat2 indexes as well (as noted here: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/hisat2.html).

EDIT

To clarify the potential inputs:

1) Use single-end reads alone and declare them in rU. Reads files for the sample might be

sample1_single_1.fastq.gz
sample1_single_2.fastq.gz

In this case r1 and r2 maybe empty lists or not declared at all in the rule.

2) Use paired-end reads and declared them in r1 and r2. Reads files for the sample might be

sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz

In this case `rU`` maybe empty list or not declared at all in the rule.

3) Use paired and single-end reads together (e.g. output from trimmomatic where some pairs are broken). Reads files for the sample might be

sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz
sample1_unpaired_1_R1.fastq.gz
sample1_unpaired_1_R2.fastq.gz
sample1_unpaired_2_R1.fastq.gz
sample1_unpaired_2_R2.fastq.gz

As a solution. I ended up using @timofeyprodanov approach. I didn't realize an empty list can be used for this. Thanks for all the answers and comments!

meono
  • 58
  • 1
  • 6

4 Answers4

3

One solution would be to pass the endedness info through the output filename or path. Something like the following should work with the existing wrapper:

def get_fastq_reads(wcs):
    if wcs.endedness == 'PE':  # Paired-end
        return ["reads/{wcs.sample}.1.fastq.gz", "reads/{wcs.sample}.2.fastq.gz"]

    if wcs.endedness == 'SE':  # Single-end
        return ["reads/{wcs.sample}.fastq.gz"]

    raise(ValueError("Unrecognized wildcard value for 'endedness': %s" % wcs.endedness))

rule hisat2:
    input:
      reads=get_fastq_reads
    output:
      "mapped/{sample}.{endedness}.bam"
    log:                                # optional
      "logs/hisat2/{sample}.{endedness}.log"
    params:                             # idx is required, extra is optional
      idx="genome",
      extra="--min-intronlen 1000"
    wildcard_constraints:
        endedness="(SE|PE)"
    threads: 8                          # optional, defaults to 1
    wrapper:
      "0.27.1/bio/hisat2"

With this single rule, one could then map reads/tardigrade.fastq.gz with

> snakemake mapped/tardigrade.SE.bam

or reads/tardigrade.{1,2}.fastq.gz with

> snakemake mapped/tardigrade.PE.bam

Note on the Index Note

I'm confused by the note on the index files, and think it may be wrong. HISAT2 doesn't accept files for that argument, but instead a single prefix that all index files have in common, so there should be only ever one argument for that. The example, idx="genome.fa", in the documentation is misleading. The index that results from building the toy reference (22_20-21M.fa) that comes with HISAT2 is 22_20-21M_snp.{1..8}.ht2, in which case one would use idx="22_20-21M_snp".

merv
  • 67,214
  • 13
  • 180
  • 245
  • thanks for the answer. I should have been more clear as I also want to be able to use PE and SE together (e.g. filtered/trimmed reads). I don't think the current wrapper supports that (although hisat2 does). But I get how you think, and already applied a similar function to deal with this. on the note on the index note: You are right, hisat2 only takes the prefix as the argument. I want the index files in the input so that a build_index rule is triggered in the pipeline. – meono Jul 20 '18 at 14:22
  • @meono Please consider adding to your original question then, because it still isn't clear what you're saying. Give an example of "use PE and SE together" - I presume you don't just mean `snakemake mapped/tardigrade.{SE,PE}.bam`. Or are you trying to share files? - that's simply a matter of adjusting the output of `get_fastq_reads`. A good example would show some example input files, a snakemake invocation, and the expected output. – merv Jul 20 '18 at 16:32
  • thanks again @merv. I hope the edit makes the question cleared. Timofey's answer does the job for me. – meono Aug 06 '18 at 08:28
3

I usually do it using expand with either empty or non-empty list:

rule a:
    input:
        expand('filename', proxy=[] if no_input else [None])
2

In order to create a conditional input in snakemake and avoid a missing input file error you can use [] instead of "".

rule example:
    input:
        in1="this_input_is_always_there.json",
        in2="this_input_is_conditional.json" if condition else [],

In your example, by replacing

rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else '',

with

rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else [],

you achieve the conditionality of the input without having to use expand or a additional function.

0

I think mere's approach, i.e. to let the output file name contain the information about the endedness, is the most natural one in Snakemake. An alternative way that requires you to have duplicate rules but not conditional statements is to use the ruleorder directive, e.g. ruleorder: align_pe > align_se. Then the higher priority rule will be used if the optional input exists.

Rasmus
  • 69
  • 7