4

I am working on writing a script which is a combination of snakemake and python code to automate a large number of files that comes in pair. More precisely, I am working on aligning reads with BWA MEM with paired end reads (http://bio-bwa.sourceforge.net/bwa.shtml). On the first part of the script, I iterated over the list of names in my file (which are fastq bunzipped files) then sorted them accordingly in a list. Here's a quick look of some files:

['NG-8653_1A_lib95899_4332_7_1', 'NG-8653_1A_lib95899_4332_7_2', 'NG-8653_1B_lib95900_4332_7_1', 'NG-8653_1B_lib95900_4332_7_2', 'NG-8653_1N_lib95898_4332_7_1', 'NG-8653_1N_lib95898_4332_7_2']

As you can see, the reads are sorted two by two (1A_...1 and 1A..._2, etc...). Now using subprocess, I want to align them by decompressing them with bunzip2 and then passing them through stdin to bwa mem. The bwa mem command transforms fastq format files to .sam files, I have then to use samtools to convert them with .bam format. Here's the script so far:

import re, os, subprocess, bz2

WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "/home/alaa/Documents/snakemake/fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"

FILE_FASTQ = glob_wildcards("fastq/{samples}.fastq.bz2")
LIST_FILE_SAMPLES = []

for x in FILE_FASTQ[0]:
    LIST_FILE_SAMPLES.append(x)

LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)

rule fastq_to_bam:
    run:
        for x in range(0, len(LIST_FILE_SAMPLES), 2):
            # get the name of the sample (1A, 1B ...)
            samp = ""
            samp += LIST_FILE_SAMPLES[x].split("_")[1]

            # get the corresponding read (1 or 2)
            r1 = SAMPLESDIR + LIST_FILE_SAMPLES[x] + ".fastq.bz2"
            r2 = SAMPLESDIR + LIST_FILE_SAMPLES[x+1] + ".fastq.bz2"

            # gunzipping the files and pipping them
            p1 = subprocess.Popen(['bunzip2', '-kc', r1], stdout=subprocess.PIPE)
            p2 = subprocess.Popen(['bunzip2', '-kc', r2], stdout=subprocess.PIPE)           


            # now write the output file to .bam format after aligning them
            with open("sam/" + samp + ".bam", "w") as stdout:
                fastq2sam = subprocess.Popen(["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout], stdout=subprocess.PIPE)
                fastq2samOutput = subprocess.Popen(["samtools", "view", "-Sb", "-"], shell = True, stdin=fastq2sam.stdout, stdout=stdout)

I was trying to debug the script by trying line by line. When writting bunzip2 to an output file, it was working fine. Now if I try to pipe it, I get an error:

Error in job fastq_to_bam while creating output file .
RuleException:
TypeError in line 39 of /home/alaa/Documents/snakemake/Snakefile:
Can't convert '_io.BufferedReader' object to str implicitly
  File "/home/alaa/Documents/snakemake/Snakefile", line 39, in __rule_fastq_to_bam
  File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
  File "/usr/lib/python3.5/subprocess.py", line 1490, in _execute_child
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
 Exiting because a job execution failed. Look above for error message
 Will exit after finishing currently running jobs.
 Exiting because a job execution failed. Look above for error message

Can you please tell me what is the problem with the script ? I am trying to look for the problem since this morning and I can't seem to figure it out. Any help is much appreciated. Thanks in advance.

EDIT 1:

After reading more about the feedback from @bli and @Johannes, I have made it this far:

import re, os, subprocess, bz2, multiprocessing
from os.path import join
from contextlib import closing

WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"


FILE_FASTQ = glob_wildcards("fastq/{samples, NG-8653_\d+[a-zA-Z]+_.+}")
LIST_FILE_SAMPLES = []

for x in FILE_FASTQ[0]:
    LIST_FILE_SAMPLES.append("_".join(x.split("_")[0:5]))

LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)


rule final:
    input:
        expand('bam/' + '{sample}.bam', sample = LIST_FILE_SAMPLES)

rule bunzip_fastq:
    input:
        r1 = SAMPLESDIR + '{sample}_1.fastq.bz2',
        r2 = SAMPLESDIR + '{sample}_2.fastq.bz2'
    output:
        o1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
        o2 = SAMPLESDIR + '{sample}_r2.fastq.gz'
    shell:
        """
        bunzip2 -kc < {input.r1} | gzip -c > {output.o1}
        bunzip2 -kc < {input.r2} | gzip -c > {output.o2}
        """

rule fastq_to_bam:
    input:
        r1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
        r2 = SAMPLESDIR + '{sample}_r2.fastq.gz',
        ref = REF
    output:
        'bam/' + '{sample}.bam'
    shell:
        """
        bwa mem {input.ref} {input.r1} {input.r2} | samtools -b > {output}
        """

Thank a lot for your help ! I think I can manage from here on.

Best regards, Alaa

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Zen
  • 41
  • 1
  • 5
  • I second Johannes Köster's comment in his answer. You might consider having a separate rule for the bunzipping, in which you could use the "shell" section instead of having to run things manually with `subprocess`. Then you make the output of this rule the input of your mapping rule (and remove the loop and use wildcards instead). – bli Apr 11 '17 at 12:16

1 Answers1

5

Your problem is here:

["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout]

p1.stdout and p2.stdout are of type BufferedReader, but subprocess.Popen expects a list of strings. What you might want to use is e.g. p1.stdout.read().

However, please be aware that your approach is not the idiomatic way to use Snakemake, in fact, there is currently nothing in the script that really makes use of Snakemake's features.

With Snakemake, you would rather have a rule that processes a single sample with bwa mem, taking fastq as input and storing bam as output. See this example in the official Snakemake tutorial. It does exactly what you are trying to accomplish here, but with much less necessary boilerplate. Simply let Snakemake do the job, don't try to reimplement this yourself.

Johannes Köster
  • 1,809
  • 6
  • 8
  • Thank you for your reply Johannes. Your comment is helpful. And what I've posted was only the beginning of the script. I fully understand how Snakemake works because I have further rules downstream ! Again, thanks for your help. I am still working on the script and I will post my workaround when I am done with it. – Zen Apr 11 '17 at 11:31