Combine multiple wildcards in Snakemake

Question

├── DIR1
│   ├── smp1.fastq.gz
│   ├── smp1_fastqc/
│   ├── smp2.fastq.gz
│   └── smp2_fastqc/
└── DIR2
    ├── smp3.fastq.gz
    ├── smp3_fastqc/
    ├── smp4.fastq.gz
    └── smp4_fastqc/

I would like to count the number of reads by sample and then concatenate all counts by directory.
I create a dictionnary to link sample 1 and 2 to directory 1, and sample 3 et 4 to directory 2

DIRS,SAMPLES = glob_wildcards(INDIR+'/{dir}/{smp}.fastq.gz')

# Create samples missing
def filter_combinator(combinator, authlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            if frozenset(wc_comb) in authlist:
                yield wc_comb
    return filtered_combinator

# Authentification
combine_dir_samples = []

for dir in DIRS:
    samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
    for smp in samples:
        combine_dir_samples.append( { "dir" : dir, "smp" : smp} )
       
combine_dir_samples = { frozenset( x.items() ) for x in combine_dir_samples }
dir_samples = filter_combinator(product, combine_dir_samples)

Then, I create a rule to count my reads by sample

rule all:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)

rule countReads:
    input:
        INDIR+'/{dir}/{smp}_fastqc/fastqc_data.txt'
    output:
        INDIR+'/{dir}/{smp}_Nreads.txt'
    shell:
        "grep 'Total\ Sequences' {input} | awk '{{print {wildcards.dir},$3}}' > {output}"

---------------------------------------------------------------
# result ok

├── DIR1
│   ├── smp1_Nreads.txt
│   └── smp2_Nreads.txt
└── DIR2
    ├── smp3_Nreads.txt
    └── smp4_Nreads.txt

> cat smp1_Nreads.txt
DIR1 15082186

But then, I would like to add a rule to concatenate my smp_Nreads.txt files by directory

rule concatNreads:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
    output:
        INDIR+'/{dir}/Nreads_{dir}.txt'
    shell:
        "cat {input} > {output}"
------------------------------------------------------------------
# result

├── DIR1
│   └── Nreads_DIR1.txt
└── DIR2
    └── Nreads_DIR2.txt

# but both files are identical
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
DIR2 11635831
DIR2 45924459

# I would like to have
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
> cat Nreads_DIR2.txt
DIR2 11635831
DIR2 45924459

I tried with different input syntax for my concat rule

expand(OUTFastq+'/{dir}/FastQC/{{smp}}_Nreads.txt', dir_samples, dir=DIRS)
lambda wildcards: expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)

I don't find any solution, it is like it doesn't care of my dictionnary for this rule.

EDIT

I tried to use a dictionnary instead of my combination filter_combinator and to use a function as input of my rule to get samples.

dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}

def func(dir):
    return dir_to_samples[dir]

rule all:
    input:
        lambda wildcards: expand(OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip', dir=wildcards.dir, smp=func(wildcards.dir))

rule fastQC:
    input:
        lambda wildcards: expand(INDIR+'/{dir}/{smp}.fastq.gz', dir=wildcards.dir, smp=func(wildcards.dir))
    output:
        OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip'
    shell:
        "fastqc {input} -o {OUTDIR}/{wildcards.dir}/FastQC/" 

> AttributeError: 'Wildcards' object has no attribute 'dir'

score 1 · Accepted Answer · answered Sep 15 '20 at 05:58

1

First of all, I think that you have overcomplicated the solution making it less idiomatic for Snakemake. As the result you are experiencing the problems implementing the rule. Anyway, let me answer just the question in the form you've asked it.

There should be no surprise that both of your Nreads_DIRx.txt files are identical, as the input doesn't depend on any wildcard in the output:

rule concatNreads:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)

Here the expand function resolves both dir and smp variables producing the list of fully specified filenames. What you need is something that really depends on the wildcards in your output:

rule concatNreads:
    input:
        lambda wildcards: ...

The {dir} is fully specified with the wildcard from the output, so you don't need to assign the values to it from the DIRS variable:

rule concatNreads:
    input:
        lambda wildcards: expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir=wildcards.dir, smp=func(wildcards.dir))

Now the question is how to implement this func function that produces the list of samples for a directory. It took me a while to understand your tricks with combine_dir_samples and filter_combinator, so I'm leaving that to you to implement the func function using that code. But what you really need is a map from DIR -> SAMPLES:

dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}

def func(dir):
    return dir_to_samples[dir]

This dir_to_samples could probably be evaluated even easier, but here is your modified solution:

for dir in DIRS:
    samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
    dir_to_samples.append({dir: samples})

answered Sep 15 '20 at 05:58

Dmitry Kuzminov

6,180
6
18
40

The tricks with `combine_dir_samples` come from here : https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar I don't want to use all the directories in the wildcard `{dir}`, that's why I prefer to expand my {dir} wildcard to a list I chose myself. I tried your way with the `dir_to_samples` returned in a function. But : `Error: list indices must be integers or slices, not str` so I can't return it directly. – Elysire Sep 15 '20 at 08:07
About your function : I tried to loop through the list and test if an indice is equal to dir : `if dir_to_samples[i] == dir: return dir_to_samples[i]`, doesn't work either because it is not really the indice I should test but the key of it. I don't know how. About my `combine_dir_sample` if you look my `rule all` I expand the output of my first rule `count reads` with this combination and it works fine, I don't have the same files in my output. I would like to do the same as an input. But if you can find the right way to use your function and it's easier, it's even better. – Elysire Sep 15 '20 at 08:13
And just for you to visualize my `combine_dir_sample` look like this : `{frozenset({('dir', 'DIR1'), ('smp', 'smp1')}), frozenset({('dir', 'DIR1'), ('smp', 'smp2')}), frozenset({('dir', 'DIR2'), ('smp', 'smp3')}), frozenset({('dir', 'DIR2'), ('smp', 'smp4')})}` – Elysire Sep 15 '20 at 08:19
@Elysire, `dir_to_samples` has to be a `dict` (see my example `dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}`). – Dmitry Kuzminov Sep 15 '20 at 08:30
You are experiencing an XY Problem. The solution you are trying to reuse is based on the question *How to use expand in snakemake when some particular combinations of wildcards are not desired?*, but using the `expand` function is not the only way to go. – Dmitry Kuzminov Sep 15 '20 at 08:34
I tried to redo everything with a dictionary instead of my combination as you suggested, I can't even get my first rule to work. I feel like I don't understand anything in Snakemake anymore. I edited my question with this new approach. – Elysire Oct 02 '20 at 13:32
@Elysire, the target rule (`rule all`) shall have no wildcards in the `output` section. You need to specify the target explicitly. As you said: "and then concatenate all counts by directory", your target should be something like `output: expand("{dir}/counts_concatenated", dir=dirs)`. Try to forget of all the contents of the files and just express your idea in terms of files and their dependencies. – Dmitry Kuzminov Oct 02 '20 at 14:40
Thank you, it's working when my target rule is a concatenation because I need only one wildcard. But, just to know : how can I write my target rule if my output is not a concatenation ? Exactly like I tried at the end of my post in my EDIT, I need both `smp` and `dir` to expand in my target rule. – Elysire Oct 05 '20 at 08:17
@Elysire, That is up to you how to describe the logic. You may use `glob_wildcards` to collect some information about actual files on your disk and get some initial wildcards values, but the rest is on you. Better try to reverse the logic: forget of what has to be done from the point of the contents of the files, and think only in terms of the files, their filenames and dependencies. – Dmitry Kuzminov Oct 05 '20 at 08:24

Combine multiple wildcards in Snakemake

1 Answers1