1

I'm writing a tool to interact with a popular data warehouse SaaS. Their online sql editor serializes sql worksheets to JSON, but the body of the SQL worksheet is zlib deflated using pako.js. I'm trying to read and inflate these zlib strings from python, but I can only decode bytestrings that contain short

An example with the sql text was the letter a:

bytestring = b'x\xef\xbf\xbdK\x04\x00\x00b\x00b\n'
zlib.decompress(bytestring[4:-4], -15).decode('utf-8')
>>> "a"

If I include a semicolon a;, this fails to decompress:

bytestring = b'x\xef\xbf\xbdK\xef\xbf\xbd\x06\x00\x00\xef\xbf\xbd\x00\xef\xbf\xbd\n'
zlib.decompress(bytestring[4:-4], -15).decode('utf-8')
*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 1: invalid start byte

Note: I've also tried these examples decoding with 'punycode', which I have found references to in the javascript implementation.

My understanding of zlib is pretty limited, but I've picked up that the first two and last four bytes of a zlib string are headers/footers and can be trimmed if we run zlib with the magic number -15. It's entirely possible there is zlib magic number that would decompress these strings without needing to strip the header and footers, but I wasn't able to get any combination to work when looping from -64 to 64.

I've breakpointed my way through the online sql worksheet editor's save and load functions and found they are using the pako zlib library pako.deflate(a, {to: 'string'}) and pako.inflate(b['body'], {to: 'string'}) And I'm able to inflate/deflate sql strings in the browser using the pako library, but haven't been able to reproduce the same results in python.

Sethish
  • 364
  • 2
  • 12

2 Answers2

3

I agree that this is a data corruption issue. zlib and pako should be able to read one another's data without any stripping fields off or adding magic numbers.

To prove it, here are a couple of demo scripts I threw together, one using pako to deflate the data and one using zlib to inflate it again:

// deflate.js
var pako = require("./pako.js");
console.log(escape(pako.deflate(process.argv[2], {to: "string"})));
# inflate.py
import urllib.parse, zlib, sys
print(zlib.decompress(urllib.parse.unquote_to_bytes(sys.stdin.read())).decode("utf-8"))

Run them on the command line using node deflate.js "Here is some example text" | inflate.py. The expected output is the argument passed to node deflate.js.

One thing that is worth pointing out about pako is the behaviour when using the to: "string" option. The documentation for this option is as follows:

to (String) - if equal to 'string', then result will be "binary string" (each char code [0..255])

It is for this reason that I use escape in the JavaScript function above. Using escape ensures that the string passed between JavaScript and Python doesn't contain any non-ASCII characters. (Note that encodeURIComponent does not work because the string contains binary data.) I then use urllib.parse.unquote_to_bytes in Python to undo this escaping.

If you can escape the pako-deflated data in the browser you could potentially pass that to Python to inflate it again.

Luke Woodward
  • 63,336
  • 16
  • 89
  • 104
  • Alas, I have no control over the JavaScript side of the process. – Sethish Apr 26 '20 at 15:12
  • @Sethish: you don't explain exactly how you are getting the serialized and/or compressed data from wherever it is stored into your Python script, but it's clear that the data is being corrupted as part of this process. I would hope my answer provides some indication that if you can fix this data-corruption issue then you should be able to read the compressed data. – Luke Woodward Apr 26 '20 at 16:18
  • while doing decompress in python, i got 'zlib.error: Error -3 while decompressing data: incorrect header check' with the example. what version of pako and python did you use? – BluePie Mar 26 '21 at 02:03
  • 1
    @BluePie: pako 1.0.11, Python 3.7.3. – Luke Woodward Mar 26 '21 at 19:27
  • @LukeWoodward thanks, pako 2.x was the problem, [https://github.com/nodeca/pako/issues/206](https://github.com/nodeca/pako/issues/206) – BluePie Mar 27 '21 at 11:26
2

Each sequence of \xef\xbf\xbd represents an instance of corruption of the original data.

In your first example, the first and only \xef\xbf\xbd should be a single byte, which is the second byte of the zlib header. In the second example, the first \xef\xbf\xbd should be the second byte of the zlib header, the second instance should be \b4, the third instance should be \ff, and the fourth instance should be \9b.

Somewhere along the way there is some UTF-8 processing that should not be happening. It's failing every time it comes across a byte with the high bit set. In those instances, it replaces the byte with that three-byte UTF-8 sequence U+FFFD, which is the "replacement" character used to represent an unknown character.

The bottom line is that your data is irretrievably corrupted. You need to fix whatever is going on upstream from there. Are you trying to use copy and paste to get the data? If you see a question mark in a black diamond, it is that UTF-8 character.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158