How to convert bytes type to str or json?

Question

How to convert that bytes type array to str or json? I have this python byte-code and I need to convert to json format or string format. How can I do that?

b'x\xda\x04\xc0\xb1\r\xc4 \x0c\x85\xe1]\xfe\x9a\x06\xae\xf36\'B\x11\xc9J$?\xbbB\xec\x9eo\xb3"\xde\xc0\x9ero\xc4Ryb\x1b\xe5?K\x18\xaa9\x97\xc4i\xdc\x17\xd6\xc7\xaf\x8f\xf3\x05\x00\x00\xff\xff l\x12l'

The duplicate assumes the data is UTF-8-encoded text. It is not. — Mark Tolonen, Jul 16 '22 at 18:28

Mark Tolonen · Answer 1 · 2022-07-16T18:33:00.010

This looks like random binary data, not encoded text, so one way of storing binary data in JSON is to use base64 encoding. The base64 algorithm ensures all the data elements are printable ASCII characters, but the result is still a bytes object, so .decode('ascii') is used to convert the ASCII bytes to a Unicode str of ASCII characters suitable for use in an object targeted for JSON use.

Example:

import base64
import json

data = b'x\xda\x04\xc0\xb1\r\xc4 \x0c\x85\xe1]\xfe\x9a\x06\xae\xf36\'B\x11\xc9J$?\xbbB\xec\x9eo\xb3"\xde\xc0\x9ero\xc4Ryb\x1b\xe5?K\x18\xaa9\x97\xc4i\xdc\x17\xd6\xc7\xaf\x8f\xf3\x05\x00\x00\xff\xff l\x12l'

j = {'data':base64.b64encode(data).decode('ascii')}
s = json.dumps(j)
print(s) # resulting JSON text

# restore back to binary data
j2 = json.loads(s)
data2 = base64.b64decode(j2['data'])
print(data2 == data)

Output:

{"data": "eNoEwLENxCAMheFd/poGrvM2J0IRyUokP7tC7J5vsyLewJ5yb8RSeWIb5T9LGKo5l8Rp3BfWx6+P8wUAAP//IGwSbA=="}
True

Simpler, but a longer result, is to use data.hex() to get a hexadecimal string representation and bytes.fromhex() to convert that back to bytes:

>>> s = data.hex()
>>> s
'78da04c0b10dc4200c85e15dfe9a06aef336274211c94a243fbb42ec9e6fb322dec09e726fc45279621be53f4b18aa3997c469dc17d6c7af8ff3050000ffff206c126c'
>>> data2 = bytes.fromhex(s)
>>> data2
b'x\xda\x04\xc0\xb1\r\xc4 \x0c\x85\xe1]\xfe\x9a\x06\xae\xf36\'B\x11\xc9J$?\xbbB\xec\x9eo\xb3"\xde\xc0\x9ero\xc4Ryb\x1b\xe5?K\x18\xaa9\x97\xc4i\xdc\x17\xd6\xc7\xaf\x8f\xf3\x05\x00\x00\xff\xff l\x12l'
>>> data2 == data
True

score 1 · Answer 2 · answered Jul 16 '22 at 18:08

1

use the decode() method of the bytes object and provide the used encoding as a argument

answered Jul 16 '22 at 18:08

Dennis

71
5

This assumes the bytes are encoded text, but it appears not. – Mark Tolonen Jul 16 '22 at 18:21

martineau · Answer 3 · 2022-07-16T20:52:33.333

0

You don't have to convert the binary data using the base64 encoding algorithm nor into a hexadecimal string as @Mark Tolonen suggests in his answer — both of which require substantially more space to represent the data than the original.

Instead you can take advantage of the fact that JSON strings are "a sequence of zero or more Unicode characters" (per the JSON spec) which means different encoding are supported. This means you can "decode" the binary data into latin1 and the "encode" it back to the original binary data.

Here's what I mean:

import json

data = b'x\xda\x04\xc0\xb1\r\xc4 \x0c\x85\xe1]\xfe\x9a\x06\xae\xf36\'B\x11\xc9J$?\xbbB\xec\x9eo\xb3"\xde\xc0\x9ero\xc4Ryb\x1b\xe5?K\x18\xaa9\x97\xc4i\xdc\x17\xd6\xc7\xaf\x8f\xf3\x05\x00\x00\xff\xff l\x12l'

j = {'data': data.decode('latin1')}
s = json.dumps(j)
print(s) # resulting JSON text

# restore back to binary data
j2 = json.loads(s)
data2 = j2['data'].encode('latin1')
assert data2 == data  # Should be identical.

Here's the difference it makes for your sample data:

import base64

print(f"{len(data)}")                                    # -> 67
print(f"{len(data.decode('latin1'))}")                   # -> 67 
print(f"{len(base64.b64encode(data).decode('ascii'))}")  # -> 92 
print(f"{len(data.hex())}")                              # -> 134

✶ Note that I learned about the encoding trick from an answer by @Sven Marnach to a question about serializing binary data long ago (and have used multiple times since).

edited Jul 16 '22 at 20:52

answered Jul 16 '22 at 20:00

martineau

119,623
25
170
301

Look at the data once you write it to a JSON though. Even with `ensure_ascii=False` bytes like 0x00 become `'\\u0000'`. If there is lots of control bytes 0x00-0x1f in the data it still gets rather large. And if written with the standard UTF_8 encoding the > 0x7F code points double as well. Base64 gets a rather consistent 33% bigger. – Mark Tolonen Jul 16 '22 at 20:35
Case in point: `len(json.dumps(bytes(range(0x20)).decode('latin1'),ensure_ascii=False))`. 32 bytes becomes 174 code points. – Mark Tolonen Jul 16 '22 at 20:51
@Mark: Yes, your mileage will vary depending on the data involved. – martineau Jul 16 '22 at 20:53
Even with the sample data, dump it to JSON: `len(json.dumps(data.decode('latin1'),ensure_ascii=False))` -> 122 bytes. – Mark Tolonen Jul 16 '22 at 20:57
@Mark: That last example was an apples to oranges comparison. Your point about it depending on the data has been made. – martineau Jul 16 '22 at 21:00
oranges to oranges then, just to complete the point: `len(json.dumps(base64.b64encode(data).decode('ascii'),ensure_ascii=False))` -> 94 – Mark Tolonen Jul 16 '22 at 21:04
1

@Mark: For the record. note that the `base64` module also supports Adobe [Ascii85](https://en.wikipedia.org/wiki/Ascii85) (aka Base85) encoding via `a85encode()` which is better than the `b64encode` in the sense of having a smaller percentage increase. – martineau Jul 16 '22 at 21:13

How to convert bytes type to str or json?

3 Answers3