4

Tools exist to provide random access to gzip and bzip2 archives:

I'm looking for any similar solution for 7zip

(The goal is to utilize the sometimes gigantic Wikipedia dump files offline without keeping decompressed copies around)

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • 1
    Slightly pedantic, I admit, but is the real goal to avoid re-compressing the archives, rather than avoid de-compressing them? (I'd expect generating the index to require decompression of the archive, albeit into memory instead of on to disk.) If you don't mind a one-off re-compression phase then you could re-compress into a 7z with the SOLID option disabled (or set to a small value), which gives you archives you can do random access into without massive waits. (IMO, that default solid option is used in more places than it should be. :() – Leo Davidson Dec 17 '10 at 08:34
  • I don't mind decompressing them as a one time cost to create the index. But I don't want to recompress them because I want limited machines such as netbooks to be able to use unchanged archive files as they are published. Recompressing is a lot slower and more resource intensive, plus the recompressed archives would no longer have MD5 checksums matching the published ones. Getting the publisher to generate the archives in a different format might take some negotiation but I'll reserve that as a last resort, in which case concatenating many smaller 7zip archives will probably also work OK. – hippietrail Dec 17 '10 at 11:26

2 Answers2

8

I tought it's better to summarize GZIP, BZIP2 and LZMA internals to make clear something:

  1. GZIP is actually a format which uses Deflate algorithm. Due to static huffman codes (deflate documents also mentions about dynamic huffman, but actually they are static too) deflate should be encoded as block-wise (sliding window is another term in here). zran.c seems to find those blocks boundaries and tries to decode up to 2 consecutive blocks at most which could contain a few KiB uncompressed data for collecting enough data to decompress (to fill entire 32 KiB window). So, random access is quite possible even without index table.

  2. BZIP2 is actually a BWT class compression algorithm. And due to BWT nature, it's no wonder that it's block-wise. It's blocks limited up to 900 KiB for each individual blocks. Also, blocks boundaries are well defined to easy recovering process (has huge distinct markers). So, you can even use multiple threads at once to decompress all data. In another words, random access is quite possible even without any table (it's already supported by format).

  3. LZMA supports up to 1 GiB dictionary and it's not block-wise encoded. It uses a range coder to encode probabilities rather than huffman coder. Even considering 64 MiB window size (very common value), due to range coder nature we can't simply decode at a given random point until fill-up entire window. Also, LZMA's state machine can be bothersome too. So, it's implementation is quite hard or even impossible.

Maybe LZMA2 or PPM methods can be used for such usages (7-zip supports them as well within 7-zip format). PPM flushes it's model when it's statistics are full and LZMA2 intentionally flushes some state at some interval to enable multi-threaded decompression. Their random access implementation can be possible.

Osman Turan
  • 1,361
  • 9
  • 12
5

My lzopfs project allows random access to lzop, gzip, bzip2 and xz files. XZ files are LZMA encoded, so hopefully are an ok substitute for 7-zip for your purposes. Note that for realistic random access, you will need to create your xz archive with a blocked encoder, such as pixz or the multithreaded mode of xz-utils 5.1.x alpha.

vasi
  • 1,026
  • 7
  • 6
  • 1
    Vasi, thank you! Can you describe how you lzopfs works with each format? Does it need full unpack to create some index? Will it work with full wikipedia dump? – osgx May 04 '14 at 18:33
  • Sure. For xz, as long as you do a blocked encode, seekability is cheap and built into the file format. For other formats, the first time lzopfs sees a file, it scans it to produce an index file. Scanning for lzop is very fast, but for bzip2 and gzip it's slow, equivalent to decompressing the whole file. At least once the index is present, seeking is cheap. It will work for a wikipedia dump if can find the offsets you need to seek to. Do you have some way to get that info? – vasi Jun 05 '14 at 10:29
  • vasi, last time I checked - http://stackoverflow.com/q/7882337/196561, full wikidump (with all history) was packed as single-block 7z, lzma method. To find Offsets in wiki dump I'll need to do one time scanning to build index ("article-name" to offset), but it is also possible to do binary search (in every several megabytes there is at least one article name, all names are sorted). PS: for bzip2 you can scan for 6-byte "pi" constant to find beginning of every compressed block (but the constant is not aligned to byte boundary). – osgx Jun 05 '14 at 11:30
  • And there are faster methods than lzop, according to Matt Mahoney's [Large Text Compression Benchmark](http://mattmahoney.net/dc/text.html), like 'libzling', 'etincelle a3', 'zlite', 'slug 1.27', 'thor 0.95 e4', all compress better and faster than lzop. Don't know about seekeing capability. – osgx Jun 05 '14 at 11:37
  • @osgx, I do search for the bzip2 constants. But I still do decompression to make sure it's not spurious, since the constant could appear (with low probability) in real block data. – vasi Jun 05 '14 at 19:45
  • And I intend to add blocked-[LZ4](https://code.google.com/p/lz4/) support next, it's just as scannable as lzop. – vasi Jun 05 '14 at 19:46
  • https://github.com/openzim/libzim was created for offline Wikipedia (Kiwix). high compression, random access – milahu Apr 20 '23 at 11:48