Encoding and decoding binary data for inclusion into JSON with Python 3

Question

I need to decide on a schema for including binary elements into a message object so that it can be decoded again on the receiving end (In my situation a consumer on an Rabbit MQ / AMQP queue).

I decided against multipart MIME encoding over JSON mostly because it seems like using Thor's hammer to push in a thumb tack. I decided against manually joining parts (binary and JSON concatenated together) mostly because every time a new requirement arises it is a whole re-design. JSON with the binary encoded in one of the fields seems like an elegant solution.

My seemingly working (confirmed by comparing MD5-sum of sent and received data) solution is doing the following:

def json_serialiser(byte_obj):
    if isinstance(byte_obj, (bytes, bytearray)):
        # File Bytes to Base64 Bytes then to String
        return base64.b64encode(byte_obj).decode('utf-8')
    raise ValueError('No encoding handler for data type ' + type(byte_obj))


def make_msg(filename, filedata):
    d = {"filename": filename,
         "datalen": len(filedata),
         "data": filedata}
    return json.dumps(d, default=json_serialiser)

On the receiving end I simply do:

def parse_json(msg):
    d = json.loads(msg)
    data = d.pop('data')
    return base64.b64decode(data), d


def file_callback(ch, method, properties, body):
    filedata, fileinfo = parse_json(body)
    print('File Name:', fileinfo.get("filename"))
    print('Received File Size', len(filedata))

My google-fu left me unable to confirm whether what I am doing is in fact valid. In particular I am concerned whether the line that produces the string from the binary data for inclusion into JSON is correct, eg the line return base64.b64encode(byte_obj).decode('utf-8')

And it seems that I am able to take a shortcut with the decoding back to binary data as the base64.b64decode() method handles the UTF-8 data as if it is ASCII - As one would expect it to be coming from the output of base64.b64encode() ... But is this a valid assumption in all cases?

Mostly I'm surprised at not being able to find any examples online of doing this. Perhaps my google patience are still on holiday!

snakecharmerb · Accepted Answer · 2020-08-22T13:55:42.383

8

The docs confirm that your approach is ok.

base64.b64encode(byte_obj).decode('utf-8') is correct - base64.b64encode requires bytes as input:

Encode the bytes-like object s using Base64 and return the encoded bytes.

However base64.b64decode accepts bytes or an ascii string:

Decode the Base64 encoded bytes-like object or ASCII string s and return the decoded bytes.

edited Aug 22 '20 at 13:55

answered Dec 27 '18 at 10:47

snakecharmerb

47,570
11
100
153

1

Thank you. But when I use eg `...decode('latin-1')` I still get the same result. So assuming everything else I do is correct, the question that remains is whether using `decode('utf-8')` is the correct approach for serialising the base64 encoded bytes to "str". – The Tahaan Dec 27 '18 at 16:24
1

It doesn't matter whether you use 'utf-8' or `latin-1` because both encodings encode ascii characters to the same values, and base64 only uses ascii. So either is fine. – snakecharmerb Dec 27 '18 at 16:28

Encoding and decoding binary data for inclusion into JSON with Python 3

1 Answers1