0

As the title says. I'm looking for a way to decompress a gzip file in such a way that I can stop decompressing on one host, upload the current location in the file and the required 32Kb of previous decompressed data required to start inflating a new block, and then pick up inflation again on another host. Ideally with existing Java packages but I can import from C if required.

I've already looked at https://github.com/madler/zlib and https://docs.oracle.com/javase/8/docs/api/java/util/zip/Inflater.html . I find the zlib library very confusing and difficult to understand but I have a vague semblance of an idea that it can be used for my needs here. Any help in the right direction would be appreciated. Currently I've gotten so far as to parse a GZIP files headers & get to the first inflate block, but the Java ZLib library doesn't allow you to inflate only one block at a time and doesn't return any sort of checkpoints along the way.

A. Eglin
  • 11
  • 4

1 Answers1

0

You are correct that the Java Inflater (spelled wrong) class does not provide an interface to the zlib capabilities required for your application. You would need to link to the zlib library in C directly.

zran.c provides a working example of what you're looking for. You use the Z_BLOCK flush value with inflate() to stop inflation at the next deflate block boundary. On return, data_type will tell you how many bits of the last consumed byte of input are part of the next block. You will need to track the offset of the next full byte to consume, and maintain your own buffer of the last 32K of uncompressed data.

Now you can deliver that offset into the file, the number of bits from the preceding byte to use for the next block, and the 32K of uncompressed data. You can start a new raw inflation with inflateInit2(), using inflatePrime() to insert those initial 0 to 7 bits from the byte that precedes the first full byte of the next block, and using inflateSetDictionary() to set the 32K (or less) of preceding uncompressed data. Then call inflate() and you're off to the races.

Since you are decoding a gzip stream, and since you need to use raw inflate to start decoding in the middle, you will also need to keep track of two more things. Those are the running total number of uncompressed bytes and the running CRC-32 of the uncompressed data. They are needed to perform the integrity check using the trailer at the end of the gzip stream, which contains the final values of those. zlib also provides the crc32() function for that calculation.

For the start of the gzip stream, you can either decode the gzip header yourself, which is straightforward referencing RFC 1952, or you can use inflateInit2() requesting gzip encoding and inflate() with Z_BLOCK to decode the header.

Note that a concatenation of valid gzip streams is a valid gzip stream. If you get to the end of a gzip trailer, and there is still more input from the gzip file, then start over with the decoding of a gzip header.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158