7

How do you gzip/gunzip files using python at a speed comparable to the underlying libraries?

tl;dr - Use shutil.copyfileobj(f_in, f_out).

I'm decompressing *.gz files as part of a larger series of file processing, and profiling to try to get python to perform "close" to built in scripts. With the amount of data I'm working with, this matters, and it seems like a generally important thing to understand.

Using the 'gunzip' bash command on a ~500MB as follows yields:

$time gunzip data.gz -k

real    0m24.805s

A naive python implementation looks like:

with open('data','wb') as out:
    with gzip.open('data.gz','rb') as fin:
        s = fin.read()
        out.write(s)

real    2m11.468s

Don't read the whole file into memory:

with open('data','wb') as out:
    with gzip.open('data.gz','rb') as fin:
        out.write(fin.read())

real    1m35.285s

Check the local machines buffer size:

>>> import io
>>> print io.DEFAULT_BUFFER_SIZE
8192

Use buffering:

with open('data','wb', 8192) as out:
    with gzip.open('data.gz','rb', 8192) as fin:
        out.write(fin.read())

real    1m19.965s

Use as much buffering as possible:

with open('data','wb',1024*1024*1024) as out:
    with gzip.open('data.gz','rb', 1024*1024*1024) as fin:
        out.write(fin.read())

real    0m50.427s

So clearly it is buffering/IO bound.

I have a moderately complex version that runs in 36sec, but involves a pre-allocated buffer and tight inner loop. I expect there's a "better way."

The code above is reasonable and clear, albeit still slower than a bash script. But if there's a solution that is extremely roundabout or complicated, it doesn't suit my needs. My main caveat is that I would like to see a "pythonic" answer.

Of course, there's always this solution:

subprocess.call(["gunzip","-k", "data.gz"])

real    0m24.332s

But for the purposes of this question, is there a faster way of processing files "pythonically".

JHiant
  • 519
  • 5
  • 11
  • sometimes python isn't always the answer, what's wrong with that? – gold_cy Apr 18 '17 at 21:01
  • 4
    Your examples indeed don't make any sense: All three of the python example 1) just copy and don't decompress at all 2) read the file into memory all at once 3) are in no way limited by io-buffering. Besides, `gunzip` and cpython's `gzip` module use the very same underlying library that is doing all the work – user2722968 Apr 18 '17 at 21:02
  • Thanks for catching those. Apologies for the necessary edits. I hit submit prematurely. 1) missed the gzip prefix from my working code. Added now. 2/3) by buffering it improved speed 2x. 4) yes, it does use the underlying library, so I'm trying to understand why it is so slow, especially given that the subprocess version is as fast as the underlying lib. 4b) foo.gz was a cut and paste from a sample, fixed now. 5.) 8219 was a typo. should have been 8192, which corresponds to the system's buffer size and added small speed increase. – JHiant Apr 18 '17 at 22:06
  • Update. If I put the max buffer size (1GB) into the read/write portions, then it gets to about half as fast (50 sec on my box) as the native implementations (25 sec). It's looking IO bound. – JHiant Apr 18 '17 at 22:33
  • 1
    This can't possibly be the code you're running. `as in:` is invalid Python syntax, because `in` is a keyword and so can't be used as a variable name. Please don't fake transcripts. – DSM Apr 18 '17 at 22:36
  • @DSM cleaned up. Thank you for catching that. – JHiant Apr 18 '17 at 22:53
  • @DmitryPolonskiy the question isn't whether python is the answer, the question is why python is performing a common operation so slowly given that it is using the same underlying library. IO buffering is part of the key, but the most elegant and pythonic answer remains to be found. – JHiant Apr 18 '17 at 23:03
  • 1
    The *"Don't read the whole file into memory"* version does read the whole file into memory. It does exactly the same as the *"naive python implementation"* (except creating that variable). – Stefan Pochmann Apr 18 '17 at 23:37
  • @StefanPochmann i'm not sure that's true. seems like it can take advantage of the fact that these are streams. just not optimally. – JHiant Apr 18 '17 at 23:43
  • @JHiant What do you mean these are streams? In Python 2 it's a `str` and in Python 3 it's a `bytes`. – Stefan Pochmann Apr 18 '17 at 23:46
  • 1
    It seems that a major reason for the observed speed differences between Python based `gzip` and the kernel based `gzip` or `gunzip` is the compression levels. Essentially, I see that Python has compression level set to `9` (https://docs.python.org/3/library/gzip.html) which is maximum, but kernel based version has compression level set as `6` by default (https://linux.die.net/man/1/gunzip). When I set the Python version as `6`, I get a minor difference (kernel stays faster). – Amin.A Nov 25 '20 at 10:52

1 Answers1

8

I'm going to post my own answer. It turns out that you do need to use an intermediate buffer; python doesn't handle this terribly well for you. You do need to play around with the size of that buffer and the "default buffer size" does get the optimal solution. In my case a very large buffer (1GB) and a smaller than default (1KB) were slower.

Additionally, I tried the built in io.BufferedReader and io.BufferedWriter classes with their readinto() options, and found that was not necessary. (not entirely true, as the gzip library is a BufferedReader so provides this.)

import gzip

buf = bytearray(8192)
with open('data', 'wb') as fout:
    with gzip.open('data.gz', 'rb') as fin:
        while fin.readinto(buf):
            fout.write(buf)

real    0m27.961s

While I suspect this is a known python pattern, seems there were a lot of people confused by this, so I will leave this here in hopes that it helps others.

@StefanPochmann got the correct answer. I hope he posts it and I will accept. The solution is:

import gzip
import shutil
with open('data', 'wb') as fout:
    with gzip.open('data.gz', 'rb') as fin:
        shutil.copyfileobj(fin,fout)

real    0m26.126s
JHiant
  • 519
  • 5
  • 11
  • 4
    How about using `shutil.copyfileobj`, as suggested in the [`gzip` examples](https://docs.python.org/3.6/library/gzip.html#examples-of-usage)? (except of course uncompressing instead of compressing) – Stefan Pochmann Apr 18 '17 at 23:51
  • Thank you @StefanPochmann. Yes, that is the best solution. Came in at 26sec which is close enough to the native solutions. Cheers. – JHiant Apr 18 '17 at 23:54
  • 1
    I don't intend to post an answer, partially because I can't do your timing. Feel free to just accept your own once that's possible. – Stefan Pochmann Apr 19 '17 at 00:07