How do I only select values from one row in tsv file?

Question

Edit: I should add that this question is basically exactly what I'm asking about: How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline
The issue is that when I try to adapt it to my Snakefile I get an error message saying 'function' object has no attribute 'unique' referring to the following line: samples= list(units_table.SampleSM.unique())
And in the same thread Johannes links to his gatk pipeline that also uses tsv files for this purpose, but I frankly don't understand what's going on so unfortunately it doesn't help. If I could get some input on my Snakefile below I'd be grateful.

I need to start my workflow by populating variables based on a tsv file like this:

flowcell    sample  library lane    R1  R2
FlowCellX   SAMPLE1 libZ    L001    fastq/FCX/Sample1_L001.R1.fastq.gz  fastq/FCX/Sample1_L001.R2.fastq.gz
FlowCellX   SAMPLE1 libZ    L002    fastq/FCX/Sample1_L002.R1.fastq.gz  fastq/FCX/Sample1_L002.R2.fastq.gz
FlowCellX   SAMPLE2 libZ    L001    fastq/FCX/Sample2_L001.R1.fastq.gz  fastq/FCX/Sample2_L001.R2.fastq.gz
FlowCellX   SAMPLE2 libZ    L002    fastq/FCX/Sample2_L002.R1.fastq.gz  fastq/FCX/Sample2_L002.R2.fastq.gz
FlowCellY   SAMPLE1 libX    L001    fastq/FCY/Sample1_L001.R1.fastq.gz  fastq/FCY/Sample1_L001.R2.fastq.gz

My problem is that I don't know how to make it only select values from one row per pair of files, the reason is that I need to construct a read group by selecting values from all columns except for R1 and R2.

The only thing I can make it do currently is to combine every entry from every row and column in flowcell, sample, library and and lane, but I just want it to select the entries on one row per pair of input files.

Here's my code so far:

import pandas as pd
from snakemake.utils import validate, min_version

samples = pd.read_table("samples.tsv", sep='\t', dtype=str).set_index(["flowcell", "sample"], drop=False)

samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels])  # enforce str in index


def get_fastq1(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample), ["R1"]].dropna()
def get_fastq2(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample), ["R2"]].dropna()

rule all:
    input:
        expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", sample=samples['sample'], lane=samples['lane'], flowcell=samples['flowcell']),

rule BwaMem:
    input:
        fasta = "/references/Homo_sapiens_assembly38.fasta",
        fastq1 = get_fastq1,
        fastq2 = get_fastq2,
    params:
        rgs = repr('@RG:\tID:{flowcell}.{lane}\tSM:{sample}\tLB:PlaceHolder\tPU:{flowcell}.{lane}.{sample}\tPL:Illumina')
    output:
        "Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam",
    shell:
        "bwa mem -M -t 12 {input.fasta} \
        -R {params.rgs} \
        {input.fastq1} \
        {input.fastq2} | \
        samtools view -Sb - > {output}"

This works if I only have one row in the tsv file, but when I add a second row so I have a tsv file like this:

flowcell    sample  library lane    R1  R2
FlowCellX   SAMPLE1 libZ    L001    fastq/Sample1_L001.R1.fastq.gz  fastq/Sample1_L001.R2.fastq.gz
FlowCellY   SAMPLE2 libZ    L002    fastq/Sample2_L002.R1.fastq.gz  fastq/Sample2_L00.R2.fastq.gz

It gives me this error message:

InputFunctionException in line 18 of /home/oskar/01-workspace/00-temp/GVP/Snakefile:
KeyError: 9
Wildcards:
sample=SAMPLE1
lane=L001
flowcell=FlowCellY

It's trying to combine SAMPLE1 and L001 with FlowCellY and that's not what I want, I want it to only select FlowCellX, SAMPLE1 and L001 for the fastq/FCX/Sample1_L001.R1.fastq.gz and fastq/FCX/Sample1_L001.R2.fastq.gz input files. The resulting output file and read group would look like this:
Output file name:

SAMPLE1_FlowCellX_L001.mapped.bam

Read group:

@RG: ID:FlowCellX.L001 SM:SAMPLE1 LB:PlaceHolder PU:FlowCellX.L001.SAMPLE1 PL:Illumina

What am I missing?

And a second issue is that I have no clue how to add a {library} variable to the read group where the ..LB:PlaceHolder\t is. If I try putting this variable in the expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", sample=samples['sample'], lane=samples['lane'], flowcell=samples['flowcell']), line it expects {library} to be in the file name and that's not what I want. I guess this issue could be solved along with the solution to the previous answer?

score 0 · Accepted Answer · answered Feb 09 '19 at 20:19

One issue is that your expand function in rule all is going to create all the combinations of sample, lane and flowcell which results in invalid queries to the sample sheet (and probably it's not what you want). If you put expand(...) in a variable and print it you would see the resulting list. The fix below uses zip function to join sample, lane and flowcell in a parallel way.

Abut the {library} question, you could have a get_RG function similar to the get_fastq functions to return the appropropriate value from the sample sheet given the wildcards. See below, there may be better solutions but this hopefully puts you in the right direction...

import pandas as pd

samples = pd.read_csv("samples.tsv", sep='\t', dtype=str).set_index(["flowcell", "sample", "lane"], drop=False)

samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels])  # enforce str in index

def get_fastq1(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample, wildcards.lane), ["R1"]].dropna()

def get_fastq2(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample, wildcards.lane), ["R2"]].dropna()

def get_RG(wc):
    library= samples.loc[(wc.flowcell, wc.sample, wc.lane), ["library"]].unique()[0]
    rgs = r'@RG:\tID:%(flowcell)s.%(lane)s\tSM:%(sample)s\tLB:%(library)s\tPU:%(flowcell)s.%(lane)s.%(sample)s\tPL:Illumina' \
        % {'flowcell': wc.flowcell, 
           'lane': wc.lane, 
           'sample': wc.sample, 
           'library': library}
    return rgs    

rule all:
    input:
        expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", zip,
            sample=samples['sample'], lane=samples['lane'],
            flowcell=samples['flowcell']),

rule BwaMem:
    input:
        fasta = "Homo_sapiens_assembly38.fasta",
        fastq1 = get_fastq1,
        fastq2 = get_fastq2,
    params:
        rgs= get_RG,
    output:
        "Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam",
    shell:
        r"bwa mem -M -t 12 {input.fasta} \
            -R '{params.rgs}' \
            {input.fastq1} \
            {input.fastq2} \
        | samtools view -Sb - > {output}"

How do I only select values from one row in tsv file?

1 Answers1