How to cast a string to bytes without encoding

Question

I have a bunch of binary data that comes to python via a char* from some C interface (not under my control) so I have a string of arbitrary binary data (what is normally a byte array). I would like to convert it to a byte array to simplify using it with other python functions but I can't seem to figure out how.

Examples that don't work:

data = rawdatastr.encode() this assumes "utf-8" and mangles the data == BAD

data = rawdatastr.encode('ascii','ignore') strips chars over 127 == BAD

data = rawdatastr.encode('latin1') not sure -- this is the closest so far but I have no proof that it is working for all bytes.

data = array.array('B', [x for x in map(ord,data)]).tobytes() This works but seems like a lot of work to do something simple. Is there something simpler?

I am thinking I need to write my own identity encoding that just passes the bytes along (I think latin1 does this based upon some reading but no proof thus far).

Is it a `str` or is it a `bytearray`? If it's a `str` it has been decoded in some way. If it's a bytearray it's already bytes-equivalent (you can make it actually the `bytes` type via `bytes(bytearray_variable)`) — anthony sottile, Mar 14 '17 at 19:41
it is a string not a byte array. As far as I can tell it has not been decoded in anyway. if you "print" it it will bring the bytes correctly '\x00\x01' etc.. — nickdmax, Mar 14 '17 at 19:48
It must be decoded in some way, `str` does not represent binary data. Either way, I've answered below. — anthony sottile, Mar 14 '17 at 19:51

anthony sottile · Answer 1 · 2020-07-01T00:20:39.550

Though I suspect something else is decoding your data for you (a char* in C is usually best represented as bytes, especially if it is binary data):

The latin1 codec can round trip every byte. You can verify this with the following short program:

>>> s = ''.join(chr(i) for i in range(0x100))
>>> s
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
>>> s2 = s.encode('latin1').decode('latin1')
>>> s2 == s
True
>>> sb = bytes(range(0x100))
>>> sb
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> sb == s.encode('latin1')
True

Thank you! I was just working up a similar program to verify each byte. Yes I think that latin1 works as an identity encoding and I think this proves it. — nickdmax, Mar 14 '17 at 19:53
In the end after trying some non-trivial examples things are broken. I think you are right that some decoding is happening because then I try the latin1 I get the error: "'latin-1' codec can't encode characters in position 0-73: ordinal not in range(256)" which implies to me that the data is indeed decoded to some codepoints beyond the 256. — nickdmax, Mar 14 '17 at 22:27
This answer was very useful, because everywhere else on the internet I could only find answers that mangle the data. +1 — David Callanan, Aug 30 '18 at 12:59
Great answer including test case! BTW: Someone else mentioned using `iso-8859-15` which is identical to `latin1`. — Tom Pohl, Dec 16 '19 at 14:52
range(0x100) is required for the full set. But having tested it, it works. — ijw, Jul 01 '20 at 00:07

Roland Smith · Answer 2 · 2018-11-26T22:41:36.603

11

Just now I ran into the same problem. This is what I came up with:

import struct

def rawbytes(s):
    """Convert a string to raw bytes without encoding"""
    outlist = []
    for cp in s:
        num = ord(cp)
        if num < 255:
            outlist.append(struct.pack('B', num))
        elif num < 65535:
            outlist.append(struct.pack('>H', num))
        else:
            b = (num & 0xFF0000) >> 16
            H = num & 0xFFFF
            outlist.append(struct.pack('>bH', b, H))
    return b''.join(outlist)

Some examples:

In [34]: rawbytes('this is a test')
Out[34]: b'this is a test'

In [35]: rawbytes('\udc80\udcdf\udcff\udcff\udcff\x7f')
Out[35]: b'\xdc\x80\xdc\xdf\xdc\xff\xdc\xff\xdc\xff\x7f'

edited Nov 26 '18 at 22:41

answered Apr 21 '18 at 16:27

Roland Smith

42,427
3
64
94

1

For this ("string") value: [\xc8\x07K\x03], I get: "struct.error: byte format requires -128 <= number <= 127" any ideas? – Noam Rathaus Nov 26 '18 at 14:54
@nrathaus You've found a bug: `struct.pack('b', num)` should be `struct.pack('B', num)`. It's fixed now. See updated answer. – Roland Smith Nov 26 '18 at 22:43

score 4 · Answer 3 · answered Sep 19 '19 at 13:53

I had this issue with a Python2 script that would talk to a Python3 script via xmlrpc. The problem was I wanted to open a file in 'wb' mode on the Python3 side. The incoming string was a bytes type when sent via Python3, but it was a str type when sent via Python2. I found using .encode only worked unreliably depending on the incoming data.

Here is the solution that worked for me:

incoming_data = bytes([ord(char) for char in incoming_data])

score 0 · Answer 4 · answered Feb 11 '19 at 14:32

0

You can simply encode('iso-8859-15')

>>> message = 'test 112 hello: what?!'
>>> message = message.encode('iso-8859-15')
>>> message 
b'test 112 hello: what?!'

answered Feb 11 '19 at 14:32

del1an

43
3

1

I tested this. It doesn't work, unfortunately. You'll see an answer below using range() that demonstrates how to test this. – ijw Jul 01 '20 at 00:06
"wihtout encode" – Enrique Benito Casado Mar 13 '22 at 07:42

score 0 · Answer 5 · answered Aug 13 '22 at 14:57

0

As example... If you have b'\xdc\x80\xdc\xdf\xdc\xff\xdc\xff\xdc\xff\x7f' as string object and you want to parse it to bytes you simple can run eval(b'\xdc\x80\xdc\xdf\xdc\xff\xdc\xff\xdc\xff\x7f').

answered Aug 13 '22 at 14:57

Micha93

628
1
9
22

They don't have `"b'\xdc\x80\xdc...'"`, they have `'\xdc\x80\xdc...'` – snakecharmerb Aug 13 '22 at 15:09

score -1 · Answer 6 · answered Mar 25 '19 at 11:27

Use base64:

>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'
>>> data = base64.b64decode(encoded)
>>> data
b'data to be encoded'

encoded variable is still bytes type, but now it has only printable ASCII characters, so You can encode them using 'uts-8'.

>>>str_data = encoded.decode('utf-8')
>>>str_data
'ZGF0YSB0byBiZSBlbmNvZGVk'
>>>encoded_str = str_data.encode('utf-8')
>>>encoded_str
 b'ZGF0YSB0byBiZSBlbmNvZGVk'

How to cast a string to bytes without encoding

6 Answers6

Linked