How to read a Gzip String with no header or mimetype? Using Python

Question

I have a gzipped string, it is created an stored from another application. Now that I have the string (no mimetype or headers attached), I need to uncompress it.

Is there a way to do this in Python?

[EDIT] To test I literally copied then pasted the string into notepad and then renamed as .gz I've also tested by pasting the string itself into IDLE

Other examples I've seen assume a filetype and mimetype are available and all I have is a big string.

Using zlib.decompress(mystring) gives error Error -3 while decompressing data: incorrect header check

Notepad can broke your string. Can you upload original file somewhere? — reclosedev, Feb 04 '12 at 18:49
Here's the link: https://www.yousendit.com/download/T2djWGJORkVmVGJsZThUQw it's a csv file with two records. Each record is a string. The array in front of the gzip string can be ignored (it's unrelated dimension data) — , Feb 04 '12 at 19:03
Are you aware that your strings encoded in Base64 or something like this? What kind of data is this records (text, binary)? — reclosedev, Feb 04 '12 at 19:17
Thank you for looking, sounds like you're onto the solution! The compressed data is a series of 1s and 0s in plain text. The developers said it was gzipped (you're likely right that it's not) and they gave me the impression that there may be line breaks between each row of 1s and 0s — , Feb 04 '12 at 19:22
If you'll decode it from Base64: `s = s.decode('base64')`, then skip 4 bytes and decompress it with [special `wbits` parameter](http://stackoverflow.com/questions/1838699/how-can-i-decompress-a-gzip-stream-with-zlib) `s = zlib.decompress(''.join(s[4:]), 16 + zlib.MAX_WBITS)` you'll get binary data. — reclosedev, Feb 04 '12 at 19:52

John Machin · Accepted Answer · 2012-02-05T20:26:34.170

Confirming the comments by @reclosedev, and adding some more:

The bytes after the ] need to be base64-decoded.

In the result of that, there are 4 bytes constituting the length of the decompressed data as a 32-bit little-endian binary number. The remainder is an RFC-1952-compliant gzip stream, recognisable by starting with 1F 8B 08. The decompression results look like binary data, not strings of ASCII 1s and 0s.

Code:

lines = [
    # extracted from the linked csv file 
    "[133,120,696,286]MmEAAB+LCAAAAAAABADtvQdg [BIG snip] a0bokyYQAA",
    "[73,65,564,263]bkgAAB+LCAAAAAAABADtvQdgHE [BIG snip] kgAAA==",
    ]
import zlib, struct
for line in lines:
    print
    b64 = line.split(']')[1]
    raw = b64.decode('base64')
    print "unknown:", repr(raw[:4])
    print "unknown as 32-bit LE int:", struct.unpack("<I", raw[:4])[0]
    ungz = zlib.decompress(raw[4:], 31)
    print len(ungz), "bytes in decompressed data"
    print "first 100:", repr(ungz[:100])

Output:

unknown: '2a\x00\x00'
unknown as 32-bit LE int: 24882
24882 bytes in decompressed data
first 100: '\xff\xe0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xf0\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00'

unknown: 'nH\x00\x00'
unknown as 32-bit LE int: 18542
18542 bytes in decompressed data
first 100: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\xff\xff\xff\xff
\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x07\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x80
\x00\x00\x00'

Update in response to comment

To get the 1s and 0s I needed I just added this to the above
cleaned = bin(int(binascii.hexlify(ungz), 16))

"Just"? You would need to strip off '0b' from the front, and then pad the front with as many leading zeroes as necessary to make the length a multiple of 8. Example, with a better method:

>>> import binascii
>>> ungz = '\x01\x80'
>>> bin(int(binascii.hexlify(ungz), 16))
'0b110000000'
>>> ''.join('{0:08b}'.format(ord(x)) for x in ungz)
'0000000110000000'

Have you checked carefully to ensure that you really want '0000000110000000' and not '1000000000000001'?

wow, thanks so much to @JohnMachin and @reclosedev. This actually works perfectly. To get the 1s and 0s I needed I just added this to the above `cleaned = bin(int(binascii.hexlify(ungz), 16))` — , Feb 05 '12 at 18:11
thanks, you are correct that I'm missing some leading 0s. I was going to leave that until Monday but your added response is much better than where I was headed. I ran a separate test example and needed to pad but I did it manually to test that everything else was working correctly. You've answered my initial question much more thoroughly that I was hoping for, thanks again! — , Feb 05 '12 at 23:40

How to read a Gzip String with no header or mimetype? Using Python

1 Answers1