3

In this code that uses zlib to encode some data, but with level=0 so it's not actually compressed:

import zlib

print('zlib.ZLIB_VERSION', zlib.ZLIB_VERSION)

total = 0
print('Total 1', total)
compress_obj = zlib.compressobj(level=0, memLevel=9, wbits=-zlib.MAX_WBITS)
total += len(compress_obj.compress(b'-' * 1000000))
print('Total 2', total)
total += len(compress_obj.flush())
print('Total 3', total)

Python 3.9.12 outputs

zlib.ZLIB_VERSION 1.2.12
Total 1 0
Total 2 983068
Total 3 1000080

but Python 3.10.6 (and Python 3.11.0) outputs

zlib.ZLIB_VERSION 1.2.13
Total 1 0
Total 2 1000080
Total 3 1000085

so both a different final size, and a different size along the way.

Why? And how can I get them to be identical? (I'm writing a library where I would prefer identical behaviour between Python versions)

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
  • 1
    to confirm, are either of your python versions from an older install? There was an update to the zlib library that [was patched into python around Apr 2022](https://github.com/python/cpython/issues/91350), so if one of your versions are from before the patch it could be a cause. – Shorn May 31 '23 at 08:25
  • 1
    @Shorn I've now put in the specific versions of Python: 3.9.12 and 3.10.6 – Michal Charemza May 31 '23 at 08:35
  • @Shorn Also now put the zlib version each uses. I see there is a difference... – Michal Charemza May 31 '23 at 14:06
  • 1
    zlib change log, for reference (not that any of the listed changes is obviously the culprit): https://www.zlib.net/ChangeLog.txt – Jeremy Friesner May 31 '23 at 14:29
  • And why exactly would you "prefer identical behaviour between Python versions"? If the data decompresses correctly, then _there is no problem_. – Mark Adler May 31 '23 at 15:41
  • If you need a compression algorithm implemented with deterministic results, that requires factoring that property into tool selection. zlib/gzip has never guaranteed it; thus, what you've been relying on is undefined behavior. – Charles Duffy May 31 '23 at 16:11
  • 1
    In general, though: Store and compare the hashes of the plaintext, not the compressed stream. There's an infinite number of possible compressed streams that decompress to the same plaintext, so that plaintext is what you should focus on integrity of. – Charles Duffy May 31 '23 at 16:13
  • 1
    @MarkAdler It's part of https://github.com/uktrade/stream-zip/pull/43 to automatically choose zip64 if it's needed for a particular input where the uncompressed size is known. I'm hoping there is some strict threshold based on uncompressed size where you can be sure only zip32 is needed. I _thought_ I had such a threshold, but it seemed different for different Python versions. – Michal Charemza May 31 '23 at 16:43
  • @CharlesDuffy I'm more trying to determine the maximum compressed size given uncompressed size in order to know if compressed data can be stored in a zip file without zip64 extensions. This is only one part of that, but I thought interesting/answerable enough for a single question. – Michal Charemza May 31 '23 at 17:17
  • It is far better to ask your _actual question_ on stackoverflow, than some sub-sub-question. This is a perfect example, since you are going about your actual question in exactly the wrong way. You want to find out what the largest expansion is for uncompressible data, but instead you found the _smallest_ expansion for uncompressible data. Backing up even further, you don't need to know if you might need zip64 extensions until such time as you already know. – Mark Adler May 31 '23 at 19:44
  • 1
    @MarkAdler Point taken about asking actual question - ish! I would more say I have many actual questions, and this was one of them. Even if it doesn't directly help the library I'm writing, it helps my curiosity and learning (and hopefully, other people's). I'm not too anti going the wrong way even... as long it's still a reasonable SO question, that's fine? I learn around the topic, and SO gets a reasonable question and answer? – Michal Charemza May 31 '23 at 21:40
  • I do think this question is a good one (reproducibility is often of practical importance!), _but_ there's a slippery slope implied by the comment above I'd like to address in case it comes up somewhere else: Stack Overflow's scope is limited to "practical, answerable questions based on actual problems that you face"; so curiosity isn't enough to establish topicality on its own. See related discussion in [What is the rationale for closing "why" questions in language design?](https://meta.stackexchange.com/questions/170394) – Charles Duffy Jun 01 '23 at 12:56
  • @CharlesDuffy Ah I guess I thought curiosity _was_ enough. Have to admit - slightly saddened by that – Michal Charemza Jun 01 '23 at 13:02
  • So https://stackoverflow.com/questions/76395799/maximum-size-of-compressed-data-using-pythons-zlib is my actual question – Michal Charemza Jun 03 '23 at 11:21

1 Answers1

4

zlib 1.2.12 and 1.2.13 behave identically in this regard. The Python library must be making different deflate() calls with different amounts of data, and possibly introducing a flush in the later version. You can look in the Python source code to find out.

You should be able to force identical output if you feed smaller amounts of data to .compress() each time, e.g. less than 64K-1, and use .flush() after each. The output will be larger, but should be identical across versions.

A quick look turned up this commit, which is likely the culprit.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158