15

I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using another external program. I have been using subprocess.Popen() to try to do this in python rather than forming unix pipes. However, all the data are buffered to memory. Is there a pythonic way of doing this task, or am I best dropping back to a simple python script that reads from stdin and writes to stdout with unix pipes on either side?

import os, sys, subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout.read():
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))
Hooked
  • 84,485
  • 43
  • 192
  • 261
seandavi
  • 2,818
  • 4
  • 25
  • 52
  • 2
    Good question. Larger than available RAM. – seandavi Oct 21 '10 at 19:30
  • Stupid error on my part. I used the read() method in the for loop above. The corrected line should, of course, not have the .read() since samtools.stdout is actually a file-like object. – seandavi Oct 21 '10 at 19:50
  • :) The read call reads the entire file to memory, the other method which I provided in the answer below treats it like a generator I believe. – anijhaw Oct 21 '10 at 19:54

5 Answers5

13

Popen has a bufsize parameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdin and stdout parameters. From the subprocess docs:

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
user470379
  • 4,879
  • 16
  • 21
  • 1
    Straight from the same docs, under the `communicate` method: "Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited." – André Caron Oct 21 '10 at 19:41
  • I posted the code above. This code definitely leads to the python process heading toward the stratosphere in terms of memory usage, so I am definitely missing some detail.... – seandavi Oct 21 '10 at 19:46
  • 2
    From docs: Changed in version 3.3.1: bufsize now defaults to -1 to enable buffering by default to match the behavior that most code expects. – cmcginty Feb 23 '18 at 22:40
  • In Python 2.x the default for bufsize is "0". – Ikem Krueger Mar 22 '18 at 14:54
  • Actually, I am doing this for a REST API in Flask, the subprocess.Popen(), process gets killed by the OOM, because, the subprocess generates data more than available ram – Quacky dev Nov 04 '21 at 11:29
  • Changed in version 3.3.1: bufsize now defaults to -1 to enable buffering by default to match the behavior that most code expects. In versions prior to Python 3.2.4 and 3.3.1 it incorrectly defaulted to 0 which was unbuffered and allowed short reads. This was unintentional and did not match the behavior of Python 2 as most code expected. – Chris Kitching Feb 23 '22 at 21:18
6

Try to make this small change, see if the efficiency is better.

 for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))
anijhaw
  • 8,954
  • 7
  • 35
  • 36
4

However, all the data are buffered to memory ...

Are you using subprocess.Popen.communicate()? By design, this function will wait for the process to finish, all the while accumulating the data in a buffer, and then return it to you. As you've pointed out, this is problematic if dealing with very large files.

If you want to process the data while it is generated, you will need to write a loop using the poll() and .stdout.read() methods, then write that output to another socket/file/etc.

Do be sure to notice the warnings in the documentation against doing this as it is easy to result in a deadlock (the parent process waits for the child process to generate data, who is in turn waiting for the parent process to empty the pipe buffer).

André Caron
  • 44,541
  • 12
  • 67
  • 125
1

I was using the .read() method on the stdout stream. Instead, I simply needed to read directly from the stream in the for loop above. The corrected code does what I expected.

#!/usr/bin/env python
import os
import sys
import subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))
seandavi
  • 2,818
  • 4
  • 25
  • 52
-1

Trying to do some basic shell piping with very large input in python:

svnadmin load /var/repo < r0-100.dump

I found the simplest way to get this working even with large (2-5GB) files was:

subprocess.check_output('svnadmin load %s < %s' % (repo, fname), shell=True)

I like this method because it's simple and you can do standard shell redirection.

I tried going the Popen route to run a redirect:

cmd = 'svnadmin load %s' % repo
p = Popen(cmd, stdin=PIPE, stdout=PIPE, shell=True)
with open(fname) as inline:
    for line in inline:
        p.communicate(input=line)

But that broke with large files. Using:

p.stdin.write() 

Also broke with very large files.

mauricio777
  • 1,296
  • 1
  • 15
  • 15
  • 1
    1- It is incorrect to call `p.communicate()` with the input more than once (the child process is dead if `p.communicate()` returns). 2- no need to use the shell: `check_call(['svnadmin', 'load', repo], stdin=input_file)` should work however large `input_file` is. – jfs May 05 '16 at 20:20
  • @J.F. Sebastian: Thanks for the info. – mauricio777 May 06 '16 at 01:19