I have an app that seeks within a .gz
file-like object.
Python's gzip.GzipFile
supports this, but very inefficiently – when the GzipFile object is asked to seek back, it will rewind to the beginning of the stream (seek(0)
) and then read and decompress everything up to the desired offset.
Needless to say this absolutely kills performance when seeking around a large tar.gz
file (tens of gigabytes).
So I'm looking to implement checkpointing: store the stream state every now and then, and when asked to seek back, go only to the next previous stored checkpoint, instead of rewinding all the way to the beginning.
My question is around the gzip
/ zlib
implementation: What does the "current decompressor state" consist of? Where is it stored? How big is it?
And how do I copy that state out of an open GzipFile object, and then assign it back for the "backward jump" seek?
Note I have no control over the input .gz files. The solution must be strictly for GzipFile in read-only rb
mode.
EDIT: Looking at CPython's source, this is the relevant code flow & data structures. Ordered from top-level (Python) down to raw C:
gzip._GzipReader.seek() == DecompressReader.seek() <=== NEED TO CHANGE THIS
ZlibDecompressor state + its deepcopy <=== NEED TO COPY / RESTORE THIS
EDIT2: Also found this teaser in zlib
:
An access point can be created at the start of any deflate block, by saving the starting file offset and bit of that block, and the 32K bytes of uncompressed data that precede that block. Also the uncompressed offset of that block is saved to provide a reference for locating a desired starting point in the uncompressed stream.
Another way to build an index would be to use inflateCopy(). That would not be constrained to have access points at block boundaries, but requires more memory per access point, and also cannot be saved to file due to the use of pointers in the state.
(they call "access points" what I call "check points"; same thing)
This pretty much answers all my questions but I still need to find a way to translate this zran.c
example to work with the gzip/zlib scaffolding in CPython.