0

I'm writing a Python library that makes ZIP files in a streaming way. If the uncompressed or compressed data of a member of the zip is 4GiB or bigger, then it has to use a particular extension to the original ZIP format - zip64. The issue with always using this is that it has less support. So, I would like to only use zip64 if needed. But whether a file is zip64 has to be specified in the zip before the compressed data, and so if streaming, before the size of the compressed data is known.

In some cases however, the size of the uncompressed data is known. So, I would like to predict the maximum size that zlib can output based on this uncompressed size, and if this is 4GiB or bigger, use zip64 mode.

In other words, if the the total length of chunks in the below is known, what will be the maximum total length of bytes that get_compressed can yield? (I assume this maximum size would depend on level, memLevel and wbits)

import zlib

chunks = (
    b'any',
    b'iterable',
    b'of',
    b'bytes',
    b'-' * 1000000,
)

def get_compressed(level=9, memLevel=9, wbits=-zlib.MAX_WBITS):
    compress_obj = zlib.compressobj(level=level, memLevel=memLevel, wbits=wbits)
    for chunk in chunks:
        if compressed := compress_obj.compress(chunk):
            yield compressed

    if compressed := compress_obj.flush():
        yield compressed

print('length', len(b''.join(get_compressed())))

This is complicated by the fact that Python zlib module's behaviour is not consistent between Python versions.

I think that Java attempts a sort of "auto zip64 mode" without knowing the uncompressed data size, but libarchive has problems with it.

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
  • Looks like it does vary by memLevel a lot. But do you actually vary memLevel (and the other arguments)? – Kelly Bundy Jun 03 '23 at 12:04
  • @KellyBundy level is varied - especially to 0 for no compression. The others... realistically no. Although I am still curious as to how they affect it. – Michal Charemza Jun 03 '23 at 12:05

3 Answers3

1

You could estimate it by compressing some random data. Compressed sizes for 1000 chunks of 1000 bytes each, with varying arguments:

level=0:  1000155 (+0.015%)
level=1:  1000155 (+0.015%)
level=2:  1000155 (+0.015%)
level=3:  1000155 (+0.015%)
level=4:  1000155 (+0.015%)
level=5:  1000155 (+0.015%)
level=6:  1000155 (+0.015%)
level=7:  1000155 (+0.015%)
level=8:  1000155 (+0.015%)
level=9:  1000155 (+0.015%)
memLevel=1:  1039350 (+3.935%)
memLevel=2:  1019600 (+1.960%)
memLevel=3:  1009780 (+0.978%)
memLevel=4:  1004885 (+0.488%)
memLevel=5:  1002445 (+0.245%)
memLevel=6:  1001225 (+0.122%)
memLevel=7:  1000615 (+0.061%)
memLevel=8:  1000310 (+0.031%)
memLevel=9:  1000155 (+0.015%)

And with 2000 chunks of 2000 bytes each:

level=0:  4000590 (+0.015%)
level=1:  4000610 (+0.015%)
level=2:  4000610 (+0.015%)
level=3:  4000610 (+0.015%)
level=4:  4000615 (+0.015%)
level=5:  4000615 (+0.015%)
level=6:  4000615 (+0.015%)
level=7:  4000615 (+0.015%)
level=8:  4000615 (+0.015%)
level=9:  4000615 (+0.015%)
memLevel=1:  4157400 (+3.935%)
memLevel=2:  4078390 (+1.960%)
memLevel=3:  4039120 (+0.978%)
memLevel=4:  4019540 (+0.488%)
memLevel=5:  4009770 (+0.244%)
memLevel=6:  4004885 (+0.122%)
memLevel=7:  4002445 (+0.061%)
memLevel=8:  4001225 (+0.031%)
memLevel=9:  4000615 (+0.015%)

So looks like if you only change level, it's about 0.015% overhead.

import zlib
import os

chunks = [
  os.urandom(1000)
  for _ in range(1000)
]

def get_compressed(level=9, memLevel=9, wbits=-zlib.MAX_WBITS):
    compress_obj = zlib.compressobj(level=level, memLevel=memLevel, wbits=wbits)
    for chunk in chunks:
        if compressed := compress_obj.compress(chunk):
            yield compressed

    if compressed := compress_obj.flush():
        yield compressed

insize = sum(map(len, chunks))
for level in range(10):
    compressed = get_compressed(level=level)
    outsize = len(b''.join(compressed))
    print(f'{level=}: ', outsize, f'({(outsize-insize)/insize:+.3%})')

for memLevel in range(1, 10):
    compressed = get_compressed(memLevel=memLevel)
    outsize = len(b''.join(compressed))
    print(f'{memLevel=}: ', outsize, f'({(outsize-insize)/insize:+.3%})')

Attempt This Online!

Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • So this is helpful, and maybe I would have to do something like this - a percentage based on experimentation. But I think I do want to strive to find a "perfect" maximum, essentially to the byte. So never would a file be marked as zip64 if based on its uncompressed size it was never actually possible for its compressed size to be 4GiB or bigger. – Michal Charemza Jun 03 '23 at 12:43
1

Sure, you could find this out. But then you are relying on a detailed, undocumented behavior of a particular version of zlib. Deflate in zlib could be modified or rewritten, and then your code is broken.

Even if you have the exact bounds for incompressible data, you could still end up with entries marked as needing Zip64 that don't. E.g. if the data is compressible, but the bound pushes it over.

Furthermore, a streaming zipper, if it is truly streaming, should be able to accept a streaming input, in which case it has no idea what the uncompressed size is in the first place. So this wouldn't help.

The right way to handle this for a streaming zipper is to mark the local header as not needing Zip64. Upon discovering that it does need Zip64, use the appropriate data descriptor, and mark the entry in the central directory as needing Zip64. If an unzipper is using the central directory, as most do, then it has the right information. If the unzipper is streaming, then it has to try all of the possible data descriptors anyway, so it didn't matter what the local header claimed.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • So I want to do this - but I think that libarchive's streaming unzipper doesn't seem to fully support this? It errors at sizes bigger than 4GiB created by Java's streaming zipper that I _think_ works this way https://github.com/libarchive/libarchive/issues/1834 – Michal Charemza Jun 03 '23 at 17:02
0

There seem to be several bounds offered at https://github.com/madler/zlib/blob/04f42ceca40f73e2978b50e93806c2a18c1281fc/deflate.c#L696

One of these is a "tight" bound of ~0.03%, calculated using:

uncompressed_size + (uncompressed_size >> 12) + (uncompressed_size >> 14) + (uncompressed_size >> 25) + 7

But - it is only applicable if memLevel == 8, and abs(wbits)=15 https://github.com/madler/zlib/issues/822

Using this, the largest value that can fit in a file without Zip64 is 4293656841 - this gives a bound of exactly the Zip32 limit of 4294967295.

To check this, can compress 4293656841 bytes of random data:

import itertools
import os
import zlib

def gen_bytes(num, chunk_size=65536):
    while num:
        to_yield = min(chunk_size, num)
        num -= to_yield
        yield os.urandom(to_yield)

def get_suspected_max(uncompressed_size):
    return uncompressed_size + (uncompressed_size >> 12) + (uncompressed_size >> 14) + (uncompressed_size >> 25) + 7

def get_compressed(chunks, level=9):
    compress_obj = zlib.compressobj(level=level, memLevel=8, wbits=-zlib.MAX_WBITS)
    for chunk in chunks:
        if compressed := compress_obj.compress(chunk):
            yield compressed

    if compressed := compress_obj.flush():
        yield compressed

def get_sum(chunks):
    s = 0
    for c in chunks:
        s += len(c)
    return s

levels = [0, 1, 8, 9]
chunk_sizes = [10000000, 1000000, 65536, 10000, 1000]
for level, chunk_size in itertools.product(levels, chunk_sizes):
    num_bytes = 4293656841
    compressed_size = get_sum(get_compressed(gen_bytes(num_bytes), level=level))
    percentage_increase = ((compressed_size - num_bytes) / num_bytes)
    percentage_increate_str = f'{percentage_increase:+.3%}'
    suspected_max = get_suspected_max(num_bytes)
    print(f'level: {level}, num_bytes: {num_bytes}, chunk_size: {chunk_size}, compressed_size: {compressed_size}, increase: {percentage_increate_str}, diff from max: {suspected_max - compressed_size}')

which outputs for me:

level: 0, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4293985671, increase: +0.008%, diff from max: 981624
level: 0, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294000341, increase: +0.008%, diff from max: 966954
level: 0, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294148216, increase: +0.011%, diff from max: 819079
level: 0, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294214516, increase: +0.013%, diff from max: 752779
level: 0, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294307396, increase: +0.015%, diff from max: 659899
level: 1, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294962226, increase: +0.030%, diff from max: 5069
level: 1, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294962221, increase: +0.030%, diff from max: 5074
level: 1, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294962216, increase: +0.030%, diff from max: 5079
level: 8, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 8, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294966576, increase: +0.031%, diff from max: 719
level: 8, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 10000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 1000000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 65536, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 10000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714
level: 9, num_bytes: 4293656841, chunk_size: 1000, compressed_size: 4294966581, increase: +0.031%, diff from max: 714

But yes - checking a few cases is certainly not a proof. This is also probably (as noted in the answer in https://stackoverflow.com/a/76396986/1319998) dependant on implementation details in zlib, which can change between versions.

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165