5

I have an app that seeks within a .gz file-like object.

Python's gzip.GzipFile supports this, but very inefficiently – when the GzipFile object is asked to seek back, it will rewind to the beginning of the stream (seek(0)) and then read and decompress everything up to the desired offset.

Needless to say this absolutely kills performance when seeking around a large tar.gz file (tens of gigabytes).

So I'm looking to implement checkpointing: store the stream state every now and then, and when asked to seek back, go only to the next previous stored checkpoint, instead of rewinding all the way to the beginning.

My question is around the gzip / zlib implementation: What does the "current decompressor state" consist of? Where is it stored? How big is it?

And how do I copy that state out of an open GzipFile object, and then assign it back for the "backward jump" seek?

Note I have no control over the input .gz files. The solution must be strictly for GzipFile in read-only rb mode.


EDIT: Looking at CPython's source, this is the relevant code flow & data structures. Ordered from top-level (Python) down to raw C:

  1. gzip.GzipFile._buffer.raw

  2. gzip._GzipReader

  3. gzip._GzipReader.seek() == DecompressReader.seek() <=== NEED TO CHANGE THIS

  4. ZlibDecompressor state + its deepcopy <=== NEED TO COPY / RESTORE THIS

  5. z_stream struct

  6. internal_state struct


EDIT2: Also found this teaser in zlib:

An access point can be created at the start of any deflate block, by saving the starting file offset and bit of that block, and the 32K bytes of uncompressed data that precede that block. Also the uncompressed offset of that block is saved to provide a reference for locating a desired starting point in the uncompressed stream.

Another way to build an index would be to use inflateCopy(). That would not be constrained to have access points at block boundaries, but requires more memory per access point, and also cannot be saved to file due to the use of pointers in the state.

(they call "access points" what I call "check points"; same thing)

This pretty much answers all my questions but I still need to find a way to translate this zran.c example to work with the gzip/zlib scaffolding in CPython.

user124114
  • 8,372
  • 11
  • 41
  • 63
  • 1
    I don't think this is exposed in the API. You'll probably have to modify zlib to add it. – Barmar Mar 22 '23 at 20:08
  • @Barmar I think it is, check out https://github.com/python/cpython/blob/90d85a9b4136aa1feb02f88aab614a3c29f20ed3/Modules/zlibmodule.c#L1147-L1191. But how to put all the pieces together? – user124114 Mar 22 '23 at 20:54
  • You've linked to source code, but "exists in the source code" and "exposed in the API" are very different thresholds. – user2357112 Mar 22 '23 at 21:15
  • @user2357112 Actually I've linked to both the API and the parts of CPython that consume it. Namely `inflateCopy()` in decompressor's deepcopy. – user124114 Mar 22 '23 at 21:19
  • Personally, I wouldn't use rsyncable gzip when you need to seek() at all, but would instead use a purpose-built file format (intended to be both compressed and seekable as well as deduplicating/chunking) like caibx -- see http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html for a design overview and https://github.com/folbricht/desync for a high-quality implementation in Go. (systemd's author wrote and maintains the original C implementation; when I tried to use it in production, which was admittedly some time ago, the rough edges were unacceptable). – Charles Duffy Mar 22 '23 at 21:33
  • @CharlesDuffy that would be great but see the **Note** in the OP. – user124114 Mar 22 '23 at 21:36
  • Some good finds in the docs, by the way. I wasn't aware that zlib saved uncompressed offsets -- a lot of patterns used to create initramfs files &c _don't_ keep offsets but just concatenate completely independent streams (which gunzip and similar tools generally decompress as a single stream entirely without complaint -- so it's worth making sure the files you're reading really do use the described facility). – Charles Duffy Mar 22 '23 at 21:41
  • 2
    Looks like somebody has already written a library to solve this problem: https://github.com/pauldmccarthy/indexed_gzip – Nick ODell Mar 22 '23 at 21:43
  • @NickODell that looks very promising! Could you post that as an answer? I'll test replacing `gzip.GzipFile` with `indexed_gzip.IndexedGzipFile` and if that works, accept your answer. – user124114 Mar 22 '23 at 22:05
  • How much do you need to jump around? – Kelly Bundy Mar 23 '23 at 12:46

1 Answers1

4

You could try a library called indexed_gzip, which builds on top of the zlib's zran.c utility. Essentially, this library keeps a series of checkpoints throughout the file, and when a request for a specific byte offset arrives, it starts from the nearest checkpoint. (indexed_gzip calls this an "index seek point.")

Example usage from the documentation:

import indexed_gzip as igzip

# You can create an IndexedGzipFile instance by specifying a file name.
myfile = igzip.IndexedGzipFile('big_file.gz')

# Or by passing an open file handle. In this use, the file handle
# must be opened in read-only binary mode:
myfile = igzip.IndexedGzipFile(fileobj=fileobj, auto_build=True, spacing=1024**2)

# Write support is currently non-existent.

The auto_build mode (True by default) allows incremental index building: each call to seek(offset) expands the checkpoint index to encompass the offset, in case it's not already covered.

You'll probably want to tune the spacing parameter, which controls how many bytes are between each checkpoint in the uncompressed file content. This is a time-memory tradeoff: more checkpoints means that less work needs to be done on each seek, but this means that more memory is used for checkpoints. It defaults to 1MB of uncompressed data.

For faster startup, you can write the index out to disk (the index is much smaller than the underlying compressed file) and load the index the next time your program runs. See Index import/export for more about how to use this feature.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Fantastic. By the way, there's a similar Python library (efficient seek) available for bz2 too: https://github.com/mxmlnkn/indexed_bzip2/tree/master/python/indexed_bzip2 – user124114 Mar 23 '23 at 12:02
  • And for XZ: https://github.com/Rogdham/python-xz – user124114 Mar 23 '23 at 12:29
  • 1
    @user124114 And there is also a parallelized rewrite of indexed_gzip here: https://github.com/mxmlnkn/indexed_bzip2/tree/master/python/pragzip – mxmlnkn Apr 27 '23 at 15:12