There seem to be several bounds offered at https://github.com/madler/zlib/blob/04f42ceca40f73e2978b50e93806c2a18c1281fc/deflate.c#L696
One of these is a "tight" bound of ~0.03%, calculated using:
uncompressed_size + (uncompressed_size >> 12) + (uncompressed_size >> 14) + (uncompressed_size >> 25) + 7
But - it is only applicable if memLevel == 8
, and abs(wbits)=15
https://github.com/madler/zlib/issues/822
Using this, the largest value that can fit in a file without Zip64 is 4293656841 - this gives a bound of exactly the Zip32 limit of 4294967295.
To check this, can compress 4293656841 bytes of random data:
import itertools
import os
import zlib
def gen_bytes(num, chunk_size=65536):
while num:
to_yield = min(chunk_size, num)
num -= to_yield
yield os.urandom(to_yield)
def get_suspected_max(uncompressed_size):
return uncompressed_size + (uncompressed_size >> 12) + (uncompressed_size >> 14) + (uncompressed_size >> 25) + 7
def get_compressed(chunks, level=9):
compress_obj = zlib.compressobj(level=level, memLevel=8, wbits=-zlib.MAX_WBITS)
for chunk in chunks:
if compressed := compress_obj.compress(chunk):
yield compressed
if compressed := compress_obj.flush():
yield compressed
def get_sum(chunks):
s = 0
for c in chunks:
s += len(c)
return s
levels = [0, 1, 8, 9]
chunk_sizes = [10000000, 1000000, 65536, 10000, 1000]
for level, chunk_size in itertools.product(levels, chunk_sizes):
num_bytes = 4293656841
compressed_size = get_sum(get_compressed(gen_bytes(num_bytes), level=level))
percentage_increase = ((compressed_size - num_bytes) / num_bytes)
percentage_increate_str = f'{percentage_increase:+.3%}'
suspected_max = get_suspected_max(num_bytes)
print(f'level: {level}, num_bytes: {num_bytes}, chunk_size: {chunk_size}, compressed_size: {compressed_size}, increase: {percentage_increate_str}, diff from max: {suspected_max - compressed_size}')
which outputs for me:
level: 0, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4293985671, increase: +0.008%, diff from max: 981624
level: 0, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294000341, increase: +0.008%, diff from max: 966954
level: 0, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294148216, increase: +0.011%, diff from max: 819079
level: 0, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294214516, increase: +0.013%, diff from max: 752779
level: 0, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294307396, increase: +0.015%, diff from max: 659899
level: 1, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294962226, increase: +0.030%, diff from max: 5069
level: 1, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294962216, increase: +0.030%, diff from max: 5079
level: 8, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294966576, increase: +0.031%, diff from max: 719
level: 8, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
But yes - checking a few cases is certainly not a proof. This is also probably (as noted in the answer in https://stackoverflow.com/a/76396986/1319998) dependant on implementation details in zlib, which can change between versions.