2

Background:
Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline.
I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using gunzip FILENAME -c) and if the first character of the stream (file) is "@", route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I'd like to minimize the number of file reads/seeks from remote storage (just a single pass through the file shouldn't be impossible?).

Contents of an example input file, first four lines corresponding to one record in FASTQ format:

@I328_1_FC30MD2AAXX:8:1:1719:1113/1                                        
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG
+I328_1_FC30MD2AAXX:8:1:1719:1113/1                                        
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc

Files that should not be piped into the filtering program contain records that look like this (first two lines corresponding to one record in FASTA format):

>I328_1_FC30MD2AAXX:8:1:1719:1113/1
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG

Some made up semi-pseudo code effort to visualize what I want to do (I know this isn't possible the way I've written it). I hope it makes some sense:

if gzipped:
    gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
    if gunzip.stdout.peek(1) == "@": # This isn't possible
        fastq = True
    else:
        fastq = False
if fastq:
    filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate()
else:
    # Send the gunzipped stream to another file

Disregard the fact that the code won't run like I've written it here and that I have no error handling etc, all that is already in my other code. I just want help with peeking into the stream or finding a way around that. I would be great if you could gunzip.stdout.peek(1) but I realize that's not possible.

What I've tried so far:
I figured subprocess.Popen might help me achieve this, and I've tried a lot of different ideas, amongst others trying to use some kind of io.BufferedRandom() object to write the stream to but I can't figure out how that would work. I know streams are non-seekable but maybe a workaround might be to read the first character of the gunzip-stream and then create a new stream where you first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream. This new stream would then be fed into filter's Popen stdin.

Note that the file sizes might be several times larger than available memory. I do not want to perform more than one single read of the source file from remote storage and no unnecessary file accessing.

Any ideas are welcome! Please ask me questions so I can clarify if I didn't make it clear enough.

Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33
  • 2
    Using the [gzip](http://docs.python.org/library/gzip.html#module-gzip) module instead of the external `gzip` should give you more flexibility. – Pedro Romano Oct 07 '12 at 18:31
  • @PedroRomano Yes it might. I'm concerned about the number of file accesses I make though. This will be part of the data acquisition step of a pipeline implemented on a supercomputer system and will probably run on a multitude of nodes simultaneously and too many file system calls might bog down the remote file server. – user1727089 Oct 08 '12 at 13:01

2 Answers2

1

Here is an implementation of your first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream proposal. I only tested the local-file branch of the test, but it should be enough to demonstrate the concept.

if gzipped:
    source = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
else:
    source = Popen(["cat", "remotestorage/file"], stdout=PIPE)
firstchar = source.stdout.read(1)
# "unread" the char we've just read
source = Popen([r"(printf '\x%02x' && cat)" % ord(firstchar)],
               shell=True, stdin=source.stdout, stdout=PIPE)

# Now feed the output to a filter or to a local file.
flocal = None
try:
    if firstchar == "@":
        filter = Popen(["filter", "localstorage/outputfile.fastq"],
                       stdin=source.stdout)
    else:
        flocal = open('localstorage/outputfile.stream', 'w')
        filter = Popen(["cat"], stdin=source.stdout, stdout=flocal)
    filter.communicate()
finally:
    if flocal is not None:
        flocal.close()

The idea is to read a single character from the source command's output, and then recreate the original output using (printf '\xhh' && cat), effectively implementing the peek. The replacement stream specifies shell=True to Popen, leaving it to the shell and cat to do the heavy lifting. The data remains in the pipeline at all times, never getting entirely read into memory. Note that services of the shell are only requested for the single call to Popen that implements unreading the peeked byte, not to the calls that involve of user-supplied file names. Even at that point, the byte is escaped to hex to make sure that the shell does not mangle it when invoking printf.

The code could be further cleaned up to implement an actual function named peek that returns the peeked contents and a replacement new_source.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • This is a hacky way of doing it I suppose, but it should probably work fine. I'll try some things out and then come back to your post if it turns out to be the best. I'm generally not a big fan of running true shells with user input in the command line (the user can set output files for example). They always manage to mess things up, unless you might bring this comment thread a bit off topic and have a great way of sanitizing such input? – user1727089 Oct 08 '12 at 12:53
  • Thanks for the compliment. :) Seriously, while the code may look like a hack, this solution works well regarding memory consumption and requires the least amount of changes to existing code, which already invokes external commands for ungzipping and filtering. "Correct" solutions would require a serious rearrangement of your code to use `zlib` and manually feed data to the filter, possibly slowing down the process. It is indicative that those who suggested such solutions did not offer working code. – user4815162342 Oct 08 '12 at 13:01
  • I've now updated the code to avoid `shell=True` with user-supplied input. – user4815162342 Oct 08 '12 at 13:07
  • I chose your answer as my favorite, even if I'll probably not use it as is anyway. It definitely answered the exact thing I wanted to solve and I commend you for that. I'm currently working on a solution which involves implementing a custom peek-method (like yours) that starts a thread that feeds the stdout into the stdin of the next Popen-process. We'll see how that works out and if it doesn't I'll most definitely go back to yours! I liked it! – user1727089 Oct 08 '12 at 14:45
  • That's an interesting approach. The `printf ... && cat` command pretty much already does that, but in a new subprocess. Note that `cat` is likely to be significantly faster at slinging data to a pipe than any loop written in Python, so I expect your solution to be slower on large inputs, but faster on small ones (because it doesn't need to exec `sh` and `cat`). If you are working on Unix, my advice would be to simply `os.fork()` instead of starting a new thread. Less hassle to manage, no worrying about shared datastructure, deadlocks, and the GIL. – user4815162342 Oct 08 '12 at 15:28
0

It doesn't make sense to wrap shell commands in Python. You can achieve everything you need in Python however without shelling out:

  1. Open the input file and read the first 3 bytes. If they equal 1F 8B 08 then it should be gzip file.
  2. Reset file marker
  3. Pass file contents to zlib.decompress() if it is a gzip file or read file
  4. Pass to filter function if required
  5. write to results to file

EDIT

This won't work as the gzip headers would need to be stripped before passing to zlib. It would be possible, however, to check the first 3 bytes, perform a fh.seek(0) and pass the file to gzip.open() if you wanted to be sure the file was a gzip (with DEFLATE compression).

It may be easier to just pass the file to gzip and catch the exception thrown if the file is not gzipped:

import gzip

try:
    in_file = gzip.open("infile")
    f_contents = in_file.read()
except IOError, e:
    # Re-raise exception if exception message is not "Not a gzipped file"
    # Perhaps it would be safer to check the header!
    if e.__str__() != "Not a gzipped file":
        raise
    in_file = open("infile")
    f_contents = in_file.read()

if f_contents[0] == "@":
    result = filter_function(f_contents)
else:
    result = f_contents

new_file = open("new_file", "w")
new_file.write(result)  
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Reading magic is a bad way to detect file types (especially since I am doubtful that is the correct magic). I think you should instead just open the file with gzip, and if it fails fall back to regular reading. – nneonneo Oct 07 '12 at 21:33
  • @nneonneo You have a point that one could pass the file to the gzip open method and catch an exception (or just doing the right thing silently) but you're wrong about everything else. Why don't you do a bit of research and find out what the gzip header format is before casting doubt on my process? – Alastair McCormack Oct 08 '12 at 09:31
  • gzip magic is only `1F 8B`; next value is a compression method that may change over time. :) – nneonneo Oct 08 '12 at 11:51
  • I suppose what meant when you doubted `1F 8B 08` was correct was that you were actually only 1/3 in doubt? http://www.ietf.org/rfc/rfc1952.txt states that the 3rd byte is the compression method where 08 means "deflate" - i.e. zlib. I doubt there's many libs out there that would support gzip with non-deflate content :) – Alastair McCormack Oct 08 '12 at 12:09
  • @nneonneo, in your comment beginning "gzip magic" you stated the next value was "version number" before editing it hence why I pointed out the true meaning of byte 3. – Alastair McCormack Oct 08 '12 at 12:41
  • Using the gzip-library in Python is definitely a thing I could try out a little more. I am just wondering what are the implications of resetting the file marker? Will it just step back in the data already in the cached read or will I make file accesses to the file system? Using the magic is probably more robust than my intended method of just checking the filename ending and handling the errors that arise if it turned out not to be gzipped. :) – user1727089 Oct 08 '12 at 12:50
  • @user1727089 I've edited my answer to include nneonneo's suggestion to try to open the file in gzip and catch any exceptions during read. If you did re-seek in the file, Python would re-request the data from the OS which is very likely to have it cached in memory. – Alastair McCormack Oct 08 '12 at 13:37
  • I like your approach to the problem and I can definitely see myself as making use of your suggestions in some other project. I'm forced to "shelling out" in this project because a lot of the software we're running expose no suitable interfaces other than the commandline one (of course gzip could be used like this anyway :). Thank you for you time! – user1727089 Oct 08 '12 at 14:43
  • The problem with this answer is that, unlike the original code, it reads the entire file contents into a string, which is undesirable for large inputs. Full-input buffering also breaks pipelining, slowing down processing on multicore systems. – user4815162342 Oct 08 '12 at 15:32
  • @user4815162342 good point. The code should be taken as an example. The OP may wish to pass the file object to the filter function and read the input with `for line in filehandle` which provides an elegant and efficient way to read large files, according to the Python documentation. – Alastair McCormack Oct 08 '12 at 17:05