Snakemake tabular configuration, expand, and merge - how to expand input files correctly?

Question

I would greatly appreciate a little pointer for the following. I have a TSV samples table:

Sample  Unit    Tumor_or_Normal Fastq1  Fastq2
A       1       T       reads/a.t.1.fastq       reads/a.t.2.fastq
A       2       N       reads/a.n.1.fastq       reads/a.n.2.fastq
B       1       T       reads/b.t1.1.fastq      reads/b.t1.2.fastq
...

and is read in

samples = pd.read_table(config["samples"], dtype=str).set_index(["Sample", "Unit", "Tumor_or_Normal"], drop=False)
samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels])

I would like to merge all bam files that have the same Sample and Tumor_or_Normal. For example, C-1-T.bam and C-2-T.bam and C-3-T.bam should be merged into C-T.bam. I have a rule

rule merge_recal_by_unit:
    input:
        expand("recal/{{Sample}}-{Unit}-{{Tumor_or_Normal}}.bam",
                Unit=samples.loc[samples.Sample].Unit)
    output:
        bam=protected("merged/{Sample}-{Tumor_or_Normal}.bam")
    params:
        ""
    threads:
        8
    wrapper:
        "0.39.0/bio/samtools/merge"

but this gave an InputFunctionException. I've also tried replacing the expand with

lamblda wildcards: expand("recal/{{Sample}}-{Unit}-{{Tumor_or_Normal}}.bam",
                           Unit=samples.loc[wildcards.Sample].Unit)

but this gave me a syntax error, and

expand("recal/{{Sample}}-{Unit}-{{Tumor_or_Normal}}.bam",
       Unit=samples.index.get_level_values('Unit').unique().values())

resulted in the message that numpy.ndarray object is not callable. This seems similar to this and this question, but I wasn't able to make it work.

Any help here would be greatly appreciated. Many thanks!

score 0 · Accepted Answer · answered Oct 10 '19 at 08:15

It seems you want to query the samples table to get all the rows sharing the same Sample and Tumor_or_Normal and use the list of Unit to construct the input list of bam files. If so, something like this should do:

rule merge_recal_by_unit:
    input:
        lambda wc: ['recal/{Sample}-%s-{Tumor_or_Normal}.bam' % x for x in 
            samples[(samples.Sample == wc.Sample) & (samples.Tumor_or_Normal == wc.Tumor_or_Normal)].Unit]
    output:
        bam=protected("merged/{Sample}-{Tumor_or_Normal}.bam")
    ...

Snakemake tabular configuration, expand, and merge - how to expand input files correctly?

1 Answers1