this is a slightly more complicated case than the one reported here. My input files are the following:
ont_Hv1_2.5+.fastq
ont_Hv2_2.5+.fastq
pacBio_Hv1_1-1.5.fastq
pacBio_Hv1_1.5-2.5.fastq
pacBio_Hv1_2.5+.fastq
pacBio_Hv2_1-1.5.fastq
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq
I would like to process only existing input files, i.e. automatically skip those wildcard combinations that correspond to non-existing input files.
My Snakefile looks like this:
import glob
import os.path
from itertools import product
#make wildcards regexps non-greedy:
wildcard_constraints:
capDesign = "[^_/]+",
sizeFrac = "[^_/]+",
techname = "[^_/]+",
# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)
# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
AUTHORIZEDCOMBINATIONS.append(tup)
# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
def filtered_combinator(*args, **kwargs):
for wc_comb in product(*args, **kwargs):
for ac in AUTHORIZEDCOMBINATIONS:
if(wc_comb[0:3] == ac):
print ("SUCCESS")
yield(wc_comb)
break
return filtered_combinator
filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)
rule all:
input:
expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)
#get read lengths for all FASTQ files:
rule getReadLength:
input: "{techname}_{capDesign}_{sizeFrac}.fastq"
output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s\"\\t\"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field
#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
output: "{techname}_{capDesign}_all.readlength.tsv"
shell: "cat {input} > {output}"
Rule getReadLength
collects read lengths for each input FASTQ (i.e. for each techname, capDesign, sizeFrac
combo).
Rule aggReadLength
merges read length statistics generated by getReadLength
, for each techname, capDesign
combo.
The workflow fails with the following message:
Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq
So it seems that the wildcard combination filtering step applied to the target is not propagated to all upstream rules it depends on. Anyone knows how to make it so?
(Using Snakemake version 4.4.0.)
Thanks a lot in advance