I am working on writing a script which is a combination of snakemake and python code to automate a large number of files that comes in pair. More precisely, I am working on aligning reads with BWA MEM with paired end reads (http://bio-bwa.sourceforge.net/bwa.shtml). On the first part of the script, I iterated over the list of names in my file (which are fastq bunzipped files) then sorted them accordingly in a list. Here's a quick look of some files:
['NG-8653_1A_lib95899_4332_7_1', 'NG-8653_1A_lib95899_4332_7_2', 'NG-8653_1B_lib95900_4332_7_1', 'NG-8653_1B_lib95900_4332_7_2', 'NG-8653_1N_lib95898_4332_7_1', 'NG-8653_1N_lib95898_4332_7_2']
As you can see, the reads are sorted two by two (1A_...1 and 1A..._2, etc...). Now using subprocess, I want to align them by decompressing them with bunzip2 and then passing them through stdin to bwa mem. The bwa mem command transforms fastq format files to .sam files, I have then to use samtools to convert them with .bam format. Here's the script so far:
import re, os, subprocess, bz2
WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "/home/alaa/Documents/snakemake/fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"
FILE_FASTQ = glob_wildcards("fastq/{samples}.fastq.bz2")
LIST_FILE_SAMPLES = []
for x in FILE_FASTQ[0]:
LIST_FILE_SAMPLES.append(x)
LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)
rule fastq_to_bam:
run:
for x in range(0, len(LIST_FILE_SAMPLES), 2):
# get the name of the sample (1A, 1B ...)
samp = ""
samp += LIST_FILE_SAMPLES[x].split("_")[1]
# get the corresponding read (1 or 2)
r1 = SAMPLESDIR + LIST_FILE_SAMPLES[x] + ".fastq.bz2"
r2 = SAMPLESDIR + LIST_FILE_SAMPLES[x+1] + ".fastq.bz2"
# gunzipping the files and pipping them
p1 = subprocess.Popen(['bunzip2', '-kc', r1], stdout=subprocess.PIPE)
p2 = subprocess.Popen(['bunzip2', '-kc', r2], stdout=subprocess.PIPE)
# now write the output file to .bam format after aligning them
with open("sam/" + samp + ".bam", "w") as stdout:
fastq2sam = subprocess.Popen(["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout], stdout=subprocess.PIPE)
fastq2samOutput = subprocess.Popen(["samtools", "view", "-Sb", "-"], shell = True, stdin=fastq2sam.stdout, stdout=stdout)
I was trying to debug the script by trying line by line. When writting bunzip2 to an output file, it was working fine. Now if I try to pipe it, I get an error:
Error in job fastq_to_bam while creating output file .
RuleException:
TypeError in line 39 of /home/alaa/Documents/snakemake/Snakefile:
Can't convert '_io.BufferedReader' object to str implicitly
File "/home/alaa/Documents/snakemake/Snakefile", line 39, in __rule_fastq_to_bam
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
File "/usr/lib/python3.5/subprocess.py", line 1490, in _execute_child
File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
Exiting because a job execution failed. Look above for error message
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Can you please tell me what is the problem with the script ? I am trying to look for the problem since this morning and I can't seem to figure it out. Any help is much appreciated. Thanks in advance.
EDIT 1:
After reading more about the feedback from @bli and @Johannes, I have made it this far:
import re, os, subprocess, bz2, multiprocessing
from os.path import join
from contextlib import closing
WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"
FILE_FASTQ = glob_wildcards("fastq/{samples, NG-8653_\d+[a-zA-Z]+_.+}")
LIST_FILE_SAMPLES = []
for x in FILE_FASTQ[0]:
LIST_FILE_SAMPLES.append("_".join(x.split("_")[0:5]))
LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)
rule final:
input:
expand('bam/' + '{sample}.bam', sample = LIST_FILE_SAMPLES)
rule bunzip_fastq:
input:
r1 = SAMPLESDIR + '{sample}_1.fastq.bz2',
r2 = SAMPLESDIR + '{sample}_2.fastq.bz2'
output:
o1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
o2 = SAMPLESDIR + '{sample}_r2.fastq.gz'
shell:
"""
bunzip2 -kc < {input.r1} | gzip -c > {output.o1}
bunzip2 -kc < {input.r2} | gzip -c > {output.o2}
"""
rule fastq_to_bam:
input:
r1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
r2 = SAMPLESDIR + '{sample}_r2.fastq.gz',
ref = REF
output:
'bam/' + '{sample}.bam'
shell:
"""
bwa mem {input.ref} {input.r1} {input.r2} | samtools -b > {output}
"""
Thank a lot for your help ! I think I can manage from here on.
Best regards, Alaa