Snakemake glob_wildcards with multiple input suffixes

Question

I am wondering if there is a way to define wildcards when the input files are named slightly differently. In this case FASTQ files have different suffixes - some end with '_L001_R1_001.fastq.gz' and some with 'R1_001.fastq.gz'. I'm hoping to use glob_wildcards to read in the run name and sample name. Is there a good way to use "or" in glob_wildcards? Any suggestions would be fantastic, thank you in advance!!

# Define samples: 
RUNS, SAMPLES = glob_wildcards(config['fastq_dir'] + "{run}/{samp}" + config['fastq1_suffix'])

My config file contains the following:

fastq_dir: 
    '~/tb/data/'
fastq1_suffix:
    '_L001_R1_001.fastq.gz'
fastq2_suffix:  
    '_L001_R2_001.fastq.gz'

First rule:

rule trim_reads:  
  input: 
    p1= config['fastq_dir'] + '{run}/{samp}' + config['fastq1_suffix'], 
    p2= config['fastq_dir'] + '{run}/{samp}' + config['fastq2_suffix']

With python you have a lot of flexibility if `glob_wildcards` doesn't suffice. In my opinion, though, it's better to put the path to fastq files in a sample sheet and use that to instruct the pipeline where to find the initial input. The problem is that you have control over the naming of the files from your pipeline but you have no control over the naming of external files so relying on filename patterns is brittle. The sample sheet will also contain information about samples, and again relying on sample names to derive sample characteristics is unreliable. — dariober, Jan 11 '22 at 09:43

score 0 · Answer 1 · answered Jan 12 '22 at 14:18

0

One hack is to create a new wildcard, something like this:

RUNS, SAMPLES, SFX = glob_wildcards("dir/{run}/{samp}_L001{suffix}.fastq.gz")

Depending on the workflow, if SFX is truly not needed, then it can be discarded with:

RUNS, SAMPLES, _ = ...

answered Jan 12 '22 at 14:18

SultanOrazbayev

14,900
3
16
46

Snakemake glob_wildcards with multiple input suffixes

1 Answers1

Linked