0

this is a slightly more complicated case than the one reported here. My input files are the following:

ont_Hv1_2.5+.fastq                                                                                                                                                                              
ont_Hv2_2.5+.fastq                                                                                                                                                                              
pacBio_Hv1_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv1_1.5-2.5.fastq                                                                                                                                                                        
pacBio_Hv1_2.5+.fastq                                                                                                                                                                           
pacBio_Hv2_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq

I would like to process only existing input files, i.e. automatically skip those wildcard combinations that correspond to non-existing input files.

My Snakefile looks like this:

import glob
import os.path
from itertools import product

#make wildcards regexps non-greedy:
wildcard_constraints:
    capDesign = "[^_/]+",
    sizeFrac = "[^_/]+",
    techname = "[^_/]+",

# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)

# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
    if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
        tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
        AUTHORIZEDCOMBINATIONS.append(tup)

# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in product(*args, **kwargs):
            for ac in AUTHORIZEDCOMBINATIONS:
                if(wc_comb[0:3] == ac):
                    print ("SUCCESS")
                    yield(wc_comb)
                    break
    return filtered_combinator

filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)

rule all:
    input:
        expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)

#get read lengths for all FASTQ files:
rule getReadLength:
    input: "{techname}_{capDesign}_{sizeFrac}.fastq"
    output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
    shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field

#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
    input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
    output: "{techname}_{capDesign}_all.readlength.tsv"
    shell: "cat {input} > {output}"

Rule getReadLength collects read lengths for each input FASTQ (i.e. for each techname, capDesign, sizeFrac combo).

Rule aggReadLength merges read length statistics generated by getReadLength, for each techname, capDesign combo.

The workflow fails with the following message:

Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq

So it seems that the wildcard combination filtering step applied to the target is not propagated to all upstream rules it depends on. Anyone knows how to make it so?

(Using Snakemake version 4.4.0.)

Thanks a lot in advance

JulienLag
  • 151
  • 1
  • 9

1 Answers1

1

I think I solved the problem, hopefully this will be useful to someone else.

In the aggReadlength rule, I replaced

input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)

with

input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)
JulienLag
  • 151
  • 1
  • 9
  • Thanks for the self reply! I think I have a similar issue, however I couldnt understand how lambda fixed it. My problem is in here : https://stackoverflow.com/questions/48443572/using-multiple-filenames-as-wildcards-in-snakemake. Do you think I can apply the same solution as you did? Thanks again! – bapors Jan 25 '18 at 13:31