Python's zlib doesn't work on CommonCrawl file

Question

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here're the first few lines of the output:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl



WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://100bravert.main.jp/public_html/wiki/index.php?cmd=backup&action=nowdiff&page=Game_log%2F%EF%BC%A7%EF%BC%AD%E6%9F%98&age=53
WARC-Date: 2022-08-07T15:32:56Z
WARC-Record-ID: <urn:uuid:8dd329bf-6717-4d0c-ae05-93445c59fd50>
WARC-Refers-To: <urn:uuid:1e2e972b-4273-468a-953f-28b0e45fb117>
WARC-Block-Digest: sha1:GTEJAN2GXLWBXDRNUEI3LLEHDIPJDPTU
WARC-Identified-Content-Language: jpn
Content-Type: text/plain
Content-Length: 12482

Game_log/ＧＭ柘 のバックアップの現在との差分(No.53) - PukiWiki
Game_log/ＧＭ柘 のバックアップの現在との差分(No.53)
[ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
バックアップ一覧

However, when I try to use Python's gzip or zlib library, using these code examples:

# using gzip
fh = gzip.open('wet.gz', 'rb')
data = fh.read(); fh.close()

# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = []
result = [o.decompress(open("wet.gz", "rb").read()), o.flush()]

Both of them return this:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

Edit for future readers: I've integrated the accepted answer to my Python package if you don't want to mess around:

pip install k1lib

from k1lib.imports import *
lines = cat("wet.gz", text=False, chunks=True) | unzip(text=True)
for line in lines:
    print(line)

This will read the file in binary mode chunk by chunk, unzips them incrementally, split up into multiple lines and convert them into strings.

Mark Adler · Accepted Answer · 2023-07-01T01:01:54.987

Your wet.gz consists of 31,849 gzip members, concatenated. Per the gzip standard, valid gzip streams concatenated is a valid gzip stream.

Python's decompressobj() is not automatically continuing to read and decompress the gzip members after the first. Yes, I would consider this to be a bug, since it is not complying with the gzip standard. Though this is a common failure to comply.

The workaround is simple. Put the Python decompression in a loop, continuing to decompress until the input is consumed. o.unused_data will return the unused input leftover after decompressing the last member, for use in decompressing the next member.

import zlib
f = open("wet.gz", "rb")
o = zlib.decompressobj(zlib.MAX_WBITS + 16)
data = left = b''
while True:
    got = f.read(32768)
    data += o.decompress(left + got)
    left = b''
    if o.eof:
        left = o.unused_data
        o = zlib.decompressobj(zlib.MAX_WBITS + 16)
    if len(got) == 0 and len(left) == 0:
        break
f.close()

(That also avoids loading the entire input into memory. For illustration, it accumulates the entire output in memory, but if possible that data should be processed as it arrives instead.)

Python's gzip.read() works for me on wet.gz, decompressing the whole thing. Perhaps you have an older version of Python.

For future readers, my Python version is 3.9.15 – 157 239n Jun 12 '23 at 00:06 — 157 239n, Jun 12 '23 at 00:06
Well that's odd. I have Python 3.9.6. – Mark Adler Jun 12 '23 at 00:24 — Mark Adler, Jun 12 '23 at 00:24

Python's zlib doesn't work on CommonCrawl file

1 Answers1