1

I am writing compressed data as a bytes type to a black-box API (i.e. I cannot change what happens under the hood). When I get that data back, it is returned as a string type which I cannot decompress using the generic python modules (zlib, bz2, etc)

In more detail, part of the problem is that this string includes the leading 'b', e.g.
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
(this is a string type).

When I compare this to the original binary representation, outside of the quotes and leading B it is identical.

If I try to simply convert back to bytes (e.g. using the bytes function) it wraps the whole thing and escapes the slashes and I get something like the following:

b"b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'"

Questions is, is it possible to convert this back to a bytes type so I can decompress it? If so, how?

I've seen a few different examples (e.g. How to cast a string to bytes without encoding) that don't quite work out for what I'm trying.

UPDATE:

Lots of good answers, thanks folks! I wish I could click accept on multiple of them. And yes, as many of you noted, it is zlib compressed. This is by design as we have extremely limited space to work with and would like to stay with JSON if possible (zlib was chosen arbitrarily to just get the quirks of binary data out, and may not be the final choice).

I.F. Adams
  • 117
  • 9

3 Answers3

2

Assuming type str for your original string, you have the following raw string (literal length 4 escape codes not an actual escape code representing 1 byte):

s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"

If you remove the leading b' and ', you can use the latin1 encoding to convert to bytes. latin1 is a 1:1 mapping of Unicode code points to byte values, because the first 256 Unicode code points represent the latin1 character set:

>>> s[2:-1].encode('latin1')
b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'

This is now a byte string, but contains literal escape codes. Now apply the unicode_escape encoding to translate back to a str of the actual code points:

>>> s2 = b.decode('unicode_escape')
>>> s2
'x\x9c«V*HLÑÍÌKËW²RPJËÏOJ,Rª\x05\x00T\x83\x07b'

This is now a Unicode string, with code points, but we still need a byte string. Encode with latin1 again:

>>> b2 = s2.encode('latin1')
>>> b2
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'

In one step:

>>> s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
>>> b = s[2:-1].encode('latin1').decode('unicode_escape').encode('latin1')
>>> b
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'

It appears this sample data is a zlib-compressed JSON string:

>>> import zlib,json
>>> json.loads(zlib.decompress(b))
{'pad-info': 'foobar'}
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • 1
    isn't your initial creation of the string he is mentioning incorrect? When you write `s = "\xac"`, you are actually putting the 172nd character in a string, not a string containing `\xac`. – Tim Nov 21 '20 at 02:50
  • 2
    @varlogtim The OP wasn’t clear, hence “assume”, but I see your point looking at the OP's `bytes` output and havel updated my answer to use a raw string. – Mark Tolonen Nov 21 '20 at 03:48
  • 1
    this is a cleaver answer. I never really thought about using latin-1, but you make a good point, it will always convert str -> bytes and bytes -> str because each character is only one byte wide. – Tim Nov 21 '20 at 19:47
2

What is happening is this:

The black-box server is stringifying bytes before the send them. You need to take the string which represents bytes and turn it back into bytes. The easiest way to do this is for Abstract Syntax Tree lib (ast).

import ast
import zlib

stringified_bytes = "b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'"
print(f"{type(stringified_bytes)}: {stringified_bytes}")

actual_bytes = ast.literal_eval(stringified_bytes)
print(f"{type(actual_bytes)}: {actual_bytes}")

answer = zlib.decompress(actual_bytes)
print(f"Answer: {answer}")

Here is a run of the script:

(venv) [ttucker@zim stackoverflow]$ python bin.py 
<class 'str'>: b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
<class 'bytes'>: b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
Answer: b'{"pad-info": "foobar"}'

... this is pretty interesting stuff ... it looks like they have another byte-string with JSON in it. Is this like one of those hacker encoding challenges?

You have a zlib file, by the way

I know this because the beginning two bytes of the data are 78 9c (x = 78 in hex) ... and if you look that up here: https://en.wikipedia.org/wiki/List_of_file_signatures, you can see it is a zlip

So, I used the zlib library to decode it ... Neat stuff.

Tim
  • 2,139
  • 13
  • 18
  • 2
    FYI, using the AST parser is slower. Probably doesn't matter for such a small string, but depends on the OP's use case. The encode/decode/encode approach I used seems like it might be slower with all the conversions, but timeit came out 4.75x faster than ast.literal_eval (1.68us vs. 7.98us using the OP's data). – Mark Tolonen Nov 21 '20 at 04:03
  • It's not an encoding challenge, actually a part of something I'm dealing with in my day job, where I'm trying to store a compressed JSON cause we have very limited space to work with. Good catch on the zlib ;) – I.F. Adams Dec 01 '20 at 21:51
  • 1
    Ahh yeah, interesting stuff. It has been a while since I have thought through this but if you want to store compressed files in JSON, I believe the best thing is data -> compression -> base64 -> JSON. Using 4 characters per byte like you are currently doing isn't the greatest... With base64, as your string gets longer, the encoding should be smaller than just storing the hex chars (which is essentially base16 encoding). – Tim Dec 01 '20 at 23:06
  • Good call, we'll keep that in mind. Thanks! – I.F. Adams Dec 01 '20 at 23:55
1

You can take the bytes from your string by selecting the whole string except first two b' and last one ' characther. Then convert it first to bytes and then decode back to a string.

Here an example:

str(bytes(bytes_string[2:-1], encoding), encoding)

Where:

bytes_string = "b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"

and encoding is the encoded used in the bytes string (e.g. 'UTF-8')