How do you gzip/gunzip files using python at a speed comparable to the underlying libraries?
tl;dr - Use shutil.copyfileobj(f_in, f_out).
I'm decompressing *.gz files as part of a larger series of file processing, and profiling to try to get python to perform "close" to built in scripts. With the amount of data I'm working with, this matters, and it seems like a generally important thing to understand.
Using the 'gunzip' bash command on a ~500MB as follows yields:
$time gunzip data.gz -k
real 0m24.805s
A naive python implementation looks like:
with open('data','wb') as out:
with gzip.open('data.gz','rb') as fin:
s = fin.read()
out.write(s)
real 2m11.468s
Don't read the whole file into memory:
with open('data','wb') as out:
with gzip.open('data.gz','rb') as fin:
out.write(fin.read())
real 1m35.285s
Check the local machines buffer size:
>>> import io
>>> print io.DEFAULT_BUFFER_SIZE
8192
Use buffering:
with open('data','wb', 8192) as out:
with gzip.open('data.gz','rb', 8192) as fin:
out.write(fin.read())
real 1m19.965s
Use as much buffering as possible:
with open('data','wb',1024*1024*1024) as out:
with gzip.open('data.gz','rb', 1024*1024*1024) as fin:
out.write(fin.read())
real 0m50.427s
So clearly it is buffering/IO bound.
I have a moderately complex version that runs in 36sec, but involves a pre-allocated buffer and tight inner loop. I expect there's a "better way."
The code above is reasonable and clear, albeit still slower than a bash script. But if there's a solution that is extremely roundabout or complicated, it doesn't suit my needs. My main caveat is that I would like to see a "pythonic" answer.
Of course, there's always this solution:
subprocess.call(["gunzip","-k", "data.gz"])
real 0m24.332s
But for the purposes of this question, is there a faster way of processing files "pythonically".