In Python 3.5+ .decode("utf-8", "backslashreplace")
is a pretty good option for dealing with partially-Unicode, partially-some-unknown-legacy-encoding binary strings. Valid UTF-8 sequences will be decoded and invalid ones will be preserved as escape sequences. For instance
>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
¡\xa1
This loses the distinction between b'\xc2\xa1\xa1'
and b'\xc2\xa1\\xa1'
, but if you're in the "just get me something not too lossy that I can fix up by hand later" frame of mind, that's probably OK.
However, this is a new feature in Python 3.5. The program I'm working on also needs to support 3.4 and 2.7. In those versions, it throws an exception:
>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback
I have found an approximation, but not an exact equivalent:
>>> print(b'\xc2\xa1\xa1'.decode("latin1")
... .encode("ascii", "backslashreplace").decode("ascii"))
\xc2\xa1\xa1
It is very important that the behavior not depend on the interpreter version. Can anyone advise a way to get exactly the Python 3.5 behavior in 2.7 and 3.4?
(Older versions of either 2.x or 3.x do not need to work. Monkey patching codecs
is totally acceptable.)