I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz
. When I unzip it on the terminal with gunzip
, everything works fine, and here're the first few lines of the output:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://100bravert.main.jp/public_html/wiki/index.php?cmd=backup&action=nowdiff&page=Game_log%2F%EF%BC%A7%EF%BC%AD%E6%9F%98&age=53
WARC-Date: 2022-08-07T15:32:56Z
WARC-Record-ID: <urn:uuid:8dd329bf-6717-4d0c-ae05-93445c59fd50>
WARC-Refers-To: <urn:uuid:1e2e972b-4273-468a-953f-28b0e45fb117>
WARC-Block-Digest: sha1:GTEJAN2GXLWBXDRNUEI3LLEHDIPJDPTU
WARC-Identified-Content-Language: jpn
Content-Type: text/plain
Content-Length: 12482
Game_log/GM柘 のバックアップの現在との差分(No.53) - PukiWiki
Game_log/GM柘 のバックアップの現在との差分(No.53)
[ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
バックアップ一覧
However, when I try to use Python's gzip
or zlib
library, using these code examples:
# using gzip
fh = gzip.open('wet.gz', 'rb')
data = fh.read(); fh.close()
# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = []
result = [o.decompress(open("wet.gz", "rb").read()), o.flush()]
Both of them return this:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for August 2022
publisher: Common Crawl
So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?
Edit for future readers: I've integrated the accepted answer to my Python package if you don't want to mess around:
pip install k1lib
from k1lib.imports import *
lines = cat("wet.gz", text=False, chunks=True) | unzip(text=True)
for line in lines:
print(line)
This will read the file in binary mode chunk by chunk, unzips them incrementally, split up into multiple lines and convert them into strings.