0

From Wikipedia, about ZPAQ Compression-

ZPAQ has 5 compression levels from fast to best. At all but the best level, it uses the statistics of the order-1 prediction table used for deduplication to test whether the input appears random. If so, it is stored without compression as a speed optimization.

I've been working with the Python Data Compression and Archiving module, and wonder if any of those implementations (ZLIB, BZ2, LZMA) do the same? Do any of them simply store the data 'as-is' when it looks almost random? I'm not a coding expert and can't really follow the source code.


Related: How to efficiently predict if data is compressible

Md. Sabbir Ahmed
  • 850
  • 8
  • 22
Paul Uszak
  • 377
  • 4
  • 18

1 Answers1

1

Some incomplete / best-guess remarks:

LZMA2 seems to do that, although for different reasons: compression-ratio; not for improving compression-time.

This is indicated at wiki:

  • LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters.
  • The XZ LZMA2 encoder processes the input in chunks (of up to 2 MB uncompressed size or 64 KB compressed size, whichever is lower), handing each chunk to the LZMA encoder, and then deciding whether to output an LZMA2 LZMA chunk including the encoded data, or to output an LZMA2 uncompressed chunk, depending on which is shorter (LZMA, like any other compressor, will necessarily expand rather than compress some kinds of data).

The latter quote also shows that there is no expected compression-speed gain as it's more or less a: do both and pick best approach.

(The article seems to focus on xz-based lzma2; probably transfers to whatever is within python; but no guarantees)

Above, together with python's docs:

Compression filters:
    FILTER_LZMA1 (for use with FORMAT_ALONE)
    FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)

would make me think you got everything you need and just need to use the right filter.

So check your reasoning again (time- or compression-ratio) and try the lzma2-filter with custom-prepared mixed data (if you don't want to trust blindly).

Intuition i don't expect the more classic zlib/bz2 formats to exploit uncompressable data (but it's a pure guess).

sascha
  • 32,238
  • 6
  • 68
  • 110