31

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc streaming the data off the disk.

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed.

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
Cera
  • 1,879
  • 2
  • 20
  • 29
  • Relevant libraries if you are processing large gzipped files with Hadoop or Spark: [GZinga](https://tech.ebayinc.com/engineering/gzinga-seekable-and-splittable-gzip/), which generates seekable gzipped files, and [SplittableGzip](https://github.com/nielsbasjes/splittablegzip), which works with any old gzipped file and "wastes" CPU time to effectively make it seekable by your cluster. Very different approaches with different trade-offs (GZinga goes for performance, SplittableGzip goes for universal compatibility) but both are interesting. – Nick Chammas Nov 05 '19 at 01:25
  • As an experiment, I wrote [a tool](https://github.com/llandsmeer/gzip-random-seek) for random access in DEFLATE streams. Surprisingly, it was possible to decompress from halfway certain simple files, but it also showed that it is indeed very much impossible in the general case :( – Lennart Feb 29 '20 at 15:32

4 Answers4

33

Yes, you can access a gzip file randomly by reading the entire thing sequentially once and building an index. See examples/zran.c in the zlib distribution.

If you are in control of creating the gzip file, then you can optimize the file for this purpose by building in random access entry points and construct the index while compressing.

You can also create a gzip file with markers by using Z_SYNC_FLUSH followed by Z_FULL_FLUSH in zlib's deflate() to insert two markers and making the next block independent of the previous data. This will reduce the compression, but not by much if you don't do this too often. E.g. once every megabyte should have very little impact. Then you can search for a nine-byte marker (with a much less probable false positive than bzip2's six-byte marker): 00 00 ff ff 00 00 00 ff ff.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 2
    Is there a tutorial on the use of zran; i.e. how do I use it to index a gzipped file and subsequently access a line number (or character number?) of my choice? I think this is the silver bullet I've been looking for. – tommy.carstensen Apr 08 '14 at 22:19
  • 1
    No, there is nothing beyond the comments in the source file. You need to read the comments for `build_index()` and `extract()`, and you can see an example of their use in `main()`. – Mark Adler Apr 08 '14 at 22:38
  • Also the following blog post may be of interest here: http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/ – PhiS Dec 13 '14 at 18:34
  • Using this approach we cannot read a gz file of size `1 GB` with memory limit of `128 MB`. gz format says the compression is limited to `32K` bytes distance - how can this feature be used to read the file sequentially, in chunks, but not entirely in memory? – user 923227 Mar 19 '18 at 23:29
  • 2
    @SumitKumarGhosh I am not understanding your question. Perhaps you should ask a new question instead of putting it in a comment. – Mark Adler Mar 19 '18 at 23:52
17

You can't do that with gzip, but you can do it with bzip2, which is block instead of stream-based - this is how the Hadoop DFS splits and parallelizes the reading of huge files with different mappers in its MapReduce algorithm. Perhaps it would make sense to re-compress your files as bz2 so you can take advantage of this; it would be easier than some ad-hoc way to chunk up the files.

I found the patches that are implementing this in Hadoop, here: https://issues.apache.org/jira/browse/HADOOP-4012

Here's another post on the topic: BZip2 file read in Hadoop

Perhaps browsing the Hadoop source code would give you an idea of how to read bzip2 files by blocks.

Community
  • 1
  • 1
Andrew Mao
  • 35,740
  • 23
  • 143
  • 224
  • You first need to read the bzip2 file sequentially to find the blocks. Then you can access them individually. The same can be done with the gzip format. – Mark Adler Jan 09 '13 at 19:30
  • I don't think what you mentioned is the best way to random access a compressed file, see this article: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html and also this issue tracker in Hadoop: https://issues.apache.org/jira/browse/HADOOP-4012 – Andrew Mao Jan 09 '13 at 19:54
  • As I say in my answer, you can prepare a gzip file for optimized random access. Some applications for random access are in control of the creation of the gzip file, in which case you would prepare the gzip file for that purpose and build an index at the same time. Some applications are not in control of the creation of the gzip file, in which case you need to decompress the thing once to build an index. – Mark Adler Jan 09 '13 at 20:43
  • The same is true of bzip2. pbzip2, which provides parallel compression and decompression of bzip2 files can only provide parallel decompression if pbzip2 itself has made the bzip2 file. In that case, the bzip2 file consists of individual bzip2 streams concatenated together. That can be done with gzip as well, as suggested by @Celada in the answers here. – Mark Adler Jan 09 '13 at 20:46
  • 2
    @MarkAdler bzip2 is much better for this task because you don't have to read from the beginning, as you do with gzip. You can dive into the middle, and look for a block boundary. From http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html: *The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with reasonable certainty.* – nealmcb Feb 10 '16 at 20:55
  • FYI, there is also standalone code for bzip2 random access, given one efficient preprocessing/index generation step at [james_taylor / seek-bzip2 — Bitbucket](https://bitbucket.org/james_taylor/seek-bzip2) and an explanation of how to use it at [james_taylor / bx-python / wiki / IO / SeekingInBzip2Files — Bitbucket](https://bitbucket.org/james_taylor/bx-python/wiki/IO/SeekingInBzip2Files) See also http://stackoverflow.com/questions/12660028/reading-memory-mapped-bzip2-compressed-file Adding block-finding code to make the bzip-table preprocessing step optional would make it great.... – nealmcb Feb 10 '16 at 21:05
12

gzip does in fact expect to be able to stream the file from the beginning. You cannot start in the middle.

What you can do is break up the file into blocks that are piecewise compressed with gzip and then concatenated together. You can choose any size you like for each piece, for example 10MB or 100MB. You then decompress starting at the beginning of the piece that contains the byte offset you require. Due to a little-known feature of gzip (which is that decompressing a file that is the concatenation of several smaller gzipped files produces the same output as decompressing each of the smaller files and concatenating the result together) the piecewise compressed large file will also work with standard gzip -d/gunzip if you download the whole thing.

The tricky part: you have to maintain an index containing the byte offset of the start of each compressed piece in the large file.

Celada
  • 21,627
  • 4
  • 64
  • 78
  • 4
    You can start in the middle of a gzip stream, so long as you have decompressed it once from the beginning and constructed entry points. Or if you have made the entry points when you compressed it. – Mark Adler Jan 09 '13 at 19:23
  • That's really interesting, @MarkAdler, thanks for the tip. You *do* have to store 32KiB worth of data together with each access point in your index, but I guess that's probably OK if the distance between access points is huge. – Celada Jan 09 '13 at 19:34
  • Correct. Though if you are in control of creating the gzip file, then you can put in historyless entry points that don't require the 32K. pigz (a parallel gzip compressor) does this with the -i option. – Mark Adler Jan 09 '13 at 22:15
  • @Celada care to have a stab at [How can I decompress and print the last few lines of a compressed text file?](http://unix.stackexchange.com/q/292556)? Could you apply your trick to get the last few lines of a large file? – terdon Jun 28 '16 at 10:50
5

FWIW: I've developed a command line tool upon zlib's zran.c which creates indexes for gzip files which allow for very quick random access inside them: https://github.com/circulosmeos/gztool

It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

circulosmeos
  • 424
  • 1
  • 6
  • 19
  • FYI, there is an updated zran you get here: https://github.com/madler/zlib/blob/develop/examples/zran.h , https://github.com/madler/zlib/blob/develop/examples/zran.c . – Mark Adler Mar 03 '23 at 23:51
  • Thanks, Mark! I'll try to see in the future what new features can I squeeze from it for `gztool` and also big thanks for the `zran` code – circulosmeos Mar 10 '23 at 10:48