Edit:
I should add that this question is basically exactly what I'm asking about: How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline
The issue is that when I try to adapt it to my Snakefile I get an error message saying 'function' object has no attribute 'unique'
referring to the following line: samples= list(units_table.SampleSM.unique())
And in the same thread Johannes links to his gatk pipeline that also uses tsv files for this purpose, but I frankly don't understand what's going on so unfortunately it doesn't help. If I could get some input on my Snakefile below I'd be grateful.
I need to start my workflow by populating variables based on a tsv file like this:
flowcell sample library lane R1 R2
FlowCellX SAMPLE1 libZ L001 fastq/FCX/Sample1_L001.R1.fastq.gz fastq/FCX/Sample1_L001.R2.fastq.gz
FlowCellX SAMPLE1 libZ L002 fastq/FCX/Sample1_L002.R1.fastq.gz fastq/FCX/Sample1_L002.R2.fastq.gz
FlowCellX SAMPLE2 libZ L001 fastq/FCX/Sample2_L001.R1.fastq.gz fastq/FCX/Sample2_L001.R2.fastq.gz
FlowCellX SAMPLE2 libZ L002 fastq/FCX/Sample2_L002.R1.fastq.gz fastq/FCX/Sample2_L002.R2.fastq.gz
FlowCellY SAMPLE1 libX L001 fastq/FCY/Sample1_L001.R1.fastq.gz fastq/FCY/Sample1_L001.R2.fastq.gz
My problem is that I don't know how to make it only select values from one row per pair of files, the reason is that I need to construct a read group by selecting values from all columns except for R1 and R2.
The only thing I can make it do currently is to combine every entry from every row and column in flowcell, sample, library and and lane, but I just want it to select the entries on one row per pair of input files.
Here's my code so far:
import pandas as pd
from snakemake.utils import validate, min_version
samples = pd.read_table("samples.tsv", sep='\t', dtype=str).set_index(["flowcell", "sample"], drop=False)
samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels]) # enforce str in index
def get_fastq1(wildcards):
return samples.loc[(wildcards.flowcell, wildcards.sample), ["R1"]].dropna()
def get_fastq2(wildcards):
return samples.loc[(wildcards.flowcell, wildcards.sample), ["R2"]].dropna()
rule all:
input:
expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", sample=samples['sample'], lane=samples['lane'], flowcell=samples['flowcell']),
rule BwaMem:
input:
fasta = "/references/Homo_sapiens_assembly38.fasta",
fastq1 = get_fastq1,
fastq2 = get_fastq2,
params:
rgs = repr('@RG:\tID:{flowcell}.{lane}\tSM:{sample}\tLB:PlaceHolder\tPU:{flowcell}.{lane}.{sample}\tPL:Illumina')
output:
"Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam",
shell:
"bwa mem -M -t 12 {input.fasta} \
-R {params.rgs} \
{input.fastq1} \
{input.fastq2} | \
samtools view -Sb - > {output}"
This works if I only have one row in the tsv file, but when I add a second row so I have a tsv file like this:
flowcell sample library lane R1 R2
FlowCellX SAMPLE1 libZ L001 fastq/Sample1_L001.R1.fastq.gz fastq/Sample1_L001.R2.fastq.gz
FlowCellY SAMPLE2 libZ L002 fastq/Sample2_L002.R1.fastq.gz fastq/Sample2_L00.R2.fastq.gz
It gives me this error message:
InputFunctionException in line 18 of /home/oskar/01-workspace/00-temp/GVP/Snakefile:
KeyError: 9
Wildcards:
sample=SAMPLE1
lane=L001
flowcell=FlowCellY
It's trying to combine SAMPLE1 and L001 with FlowCellY and that's not what I want, I want it to only select FlowCellX
, SAMPLE1
and L001
for the fastq/FCX/Sample1_L001.R1.fastq.gz
and fastq/FCX/Sample1_L001.R2.fastq.gz
input files. The resulting output file and read group would look like this:
Output file name:
SAMPLE1_FlowCellX_L001.mapped.bam
Read group:
@RG: ID:FlowCellX.L001 SM:SAMPLE1 LB:PlaceHolder PU:FlowCellX.L001.SAMPLE1 PL:Illumina
What am I missing?
And a second issue is that I have no clue how to add a {library}
variable to the read group where the ..LB:PlaceHolder\t
is. If I try putting this variable in the expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", sample=samples['sample'], lane=samples['lane'], flowcell=samples['flowcell']),
line it expects {library}
to be in the file name and that's not what I want. I guess this issue could be solved along with the solution to the previous answer?