0

Given a byte string, for instanceB = b"\x81\xc9\x00\x07I ABCD_\xe2\x86\x97_" I want to be able to convert this to the valid printable UTF-8 string that is as UTF-8 as possible: S = "\\x81\\xc9\\x00\\x07I ABCD_↗_". Note that the first group of hex bytes are not valid UTF-8 characters, but the last 3 do define a valid UTF-8 character (the arrow). It seems like this should be part of codecs but I cannot figure out how to make this happen.

for instance

>>> codecs.decode(codecs.escape_encode(B, 'utf-8')[0], 'utf-8')
'\\x81\\xc9\\x00\\x07I\\x19ABCD_\\xe2\\x86\\x97_'

escapes a valid UTF-8 character along with the invalid characters.

1 Answers1

2

Specifying 'backslashreplace' as the error handling mode when decoding a bytestring will replace un-decodable bytes with backslashed escape sequences:

decoded = b.decode('utf-8', errors='backslashreplace')

Also, this is a decoding operation, not an encoding operation. Decoding is bytes->string. Encoding is string->bytes.

user2357112
  • 260,549
  • 28
  • 431
  • 505