Very large input and piping using subprocess.Popen

Question

I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using another external program. I have been using subprocess.Popen() to try to do this in python rather than forming unix pipes. However, all the data are buffered to memory. Is there a pythonic way of doing this task, or am I best dropping back to a simple python script that reads from stdin and writes to stdout with unix pipes on either side?

import os, sys, subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout.read():
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

Stupid error on my part. I used the read() method in the for loop above. The corrected line should, of course, not have the .read() since samtools.stdout is actually a file-like object. — seandavi, Oct 21 '10 at 19:50
:) The read call reads the entire file to memory, the other method which I provided in the answer below treats it like a generator I believe. — anijhaw, Oct 21 '10 at 19:54

score 13 · Accepted Answer · edited Jun 30 '15 at 14:39

13

Popen has a bufsize parameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdin and stdout parameters. From the subprocess docs:

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).

edited Jun 30 '15 at 14:39

Mr_and_Mrs_D

32,208
39
178
361

answered Oct 21 '10 at 19:24

user470379

4,879
16
21

1

Straight from the same docs, under the `communicate` method: "Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited." – André Caron Oct 21 '10 at 19:41
I posted the code above. This code definitely leads to the python process heading toward the stratosphere in terms of memory usage, so I am definitely missing some detail.... – seandavi Oct 21 '10 at 19:46
2

From docs: Changed in version 3.3.1: bufsize now defaults to -1 to enable buffering by default to match the behavior that most code expects. – cmcginty Feb 23 '18 at 22:40
In Python 2.x the default for bufsize is "0". – Ikem Krueger Mar 22 '18 at 14:54
Actually, I am doing this for a REST API in Flask, the subprocess.Popen(), process gets killed by the OOM, because, the subprocess generates data more than available ram – Quacky dev Nov 04 '21 at 11:29
Changed in version 3.3.1: bufsize now defaults to -1 to enable buffering by default to match the behavior that most code expects. In versions prior to Python 3.2.4 and 3.3.1 it incorrectly defaulted to 0 which was unbuffered and allowed short reads. This was unintentional and did not match the behavior of Python 2 as most code expected. – Chris Kitching Feb 23 '22 at 21:18

score 6 · Answer 2 · answered Oct 21 '10 at 19:48

6

Try to make this small change, see if the efficiency is better.

 for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

answered Oct 21 '10 at 19:48

anijhaw

8,954
7
35
36

1

That was the issue, anijhaw. Thanks for noticing. – seandavi Oct 21 '10 at 19:54

score 4 · Answer 3 · answered Oct 21 '10 at 19:39

However, all the data are buffered to memory ...

Are you using subprocess.Popen.communicate()? By design, this function will wait for the process to finish, all the while accumulating the data in a buffer, and then return it to you. As you've pointed out, this is problematic if dealing with very large files.

If you want to process the data while it is generated, you will need to write a loop using the poll() and .stdout.read() methods, then write that output to another socket/file/etc.

Do be sure to notice the warnings in the documentation against doing this as it is easy to result in a deadlock (the parent process waits for the child process to generate data, who is in turn waiting for the parent process to empty the pipe buffer).

score 1 · Answer 4 · answered Oct 21 '10 at 19:53

I was using the .read() method on the stdout stream. Instead, I simply needed to read directly from the stream in the for loop above. The corrected code does what I expected.

#!/usr/bin/env python
import os
import sys
import subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

mauricio777 · Answer 5 · 2016-05-05T19:23:26.687

-1

Trying to do some basic shell piping with very large input in python:

svnadmin load /var/repo < r0-100.dump

I found the simplest way to get this working even with large (2-5GB) files was:

subprocess.check_output('svnadmin load %s < %s' % (repo, fname), shell=True)

I like this method because it's simple and you can do standard shell redirection.

I tried going the Popen route to run a redirect:

cmd = 'svnadmin load %s' % repo
p = Popen(cmd, stdin=PIPE, stdout=PIPE, shell=True)
with open(fname) as inline:
    for line in inline:
        p.communicate(input=line)

But that broke with large files. Using:

p.stdin.write()

Also broke with very large files.

edited May 05 '16 at 19:23

answered May 05 '16 at 09:09

mauricio777

1,296
1
15
15

1

1- It is incorrect to call `p.communicate()` with the input more than once (the child process is dead if `p.communicate()` returns). 2- no need to use the shell: `check_call(['svnadmin', 'load', repo], stdin=input_file)` should work however large `input_file` is. – jfs May 05 '16 at 20:20
@J.F. Sebastian: Thanks for the info. – mauricio777 May 06 '16 at 01:19

Very large input and piping using subprocess.Popen

5 Answers5

Linked