15

In Python 3.5+ .decode("utf-8", "backslashreplace") is a pretty good option for dealing with partially-Unicode, partially-some-unknown-legacy-encoding binary strings. Valid UTF-8 sequences will be decoded and invalid ones will be preserved as escape sequences. For instance

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
¡\xa1

This loses the distinction between b'\xc2\xa1\xa1' and b'\xc2\xa1\\xa1', but if you're in the "just get me something not too lossy that I can fix up by hand later" frame of mind, that's probably OK.

However, this is a new feature in Python 3.5. The program I'm working on also needs to support 3.4 and 2.7. In those versions, it throws an exception:

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback

I have found an approximation, but not an exact equivalent:

>>> print(b'\xc2\xa1\xa1'.decode("latin1")
...       .encode("ascii", "backslashreplace").decode("ascii"))
\xc2\xa1\xa1

It is very important that the behavior not depend on the interpreter version. Can anyone advise a way to get exactly the Python 3.5 behavior in 2.7 and 3.4?

(Older versions of either 2.x or 3.x do not need to work. Monkey patching codecs is totally acceptable.)

anthony sottile
  • 61,815
  • 15
  • 148
  • 207
zwol
  • 135,547
  • 38
  • 252
  • 361
  • "Changed in version 3.5: The 'backslashreplace' error handlers now works with decoding and translating." -- Did you mean 3.4 or 3.5? – Josh Lee Mar 17 '17 at 14:42
  • @JoshLee I was sloppy and only tested it in 3.5. I do in fact need something that works with 3.4. – zwol Mar 17 '17 at 15:51

2 Answers2

7

I attempted a more complete backport of the cpython implementation

This handles both UnicodeDecodeError (from .decode()) as well as UnicodeEncodeError from .encode() and UnicodeTranslateError from .translate():

from __future__ import unicode_literals

import codecs


def _bytes_repr(c):
    """py2: bytes, py3: int"""
    if not isinstance(c, int):
        c = ord(c)
    return '\\x{:x}'.format(c)


def _text_repr(c):
    d = ord(c)
    if d >= 0x10000:
        return '\\U{:08x}'.format(d)
    else:
        return '\\u{:04x}'.format(d)


def backslashescape_backport(ex):
    s, start, end = ex.object, ex.start, ex.end
    c_repr = _bytes_repr if isinstance(ex, UnicodeDecodeError) else _text_repr
    return ''.join(c_repr(c) for c in s[start:end]), end


codecs.register_error('backslashescape_backport', backslashescape_backport)

print(b'\xc2\xa1\xa1after'.decode('utf-8', 'backslashescape_backport'))
print(u'\u2603'.encode('latin1', 'backslashescape_backport'))
anthony sottile
  • 61,815
  • 15
  • 148
  • 207
4

You can write your own error handler. Here's a solution that I tested on Python 2.7, 3.3 and 3.6:

from __future__ import print_function
import codecs
import sys

print(sys.version)

def myreplace(ex):
    # The error handler receives the UnicodeDecodeError, which contains arguments of the
    # string and start/end indexes of the bad portion.
    bstr,start,end = ex.object,ex.start,ex.end

    # The return value is a tuple of Unicode string and the index to continue conversion.
    # Note: iterating byte strings returns int on 3.x but str on 2.x
    return u''.join('\\x{:02x}'.format(c if isinstance(c,int) else ord(c))
                    for c in bstr[start:end]),end

codecs.register_error('myreplace',myreplace)
print(b'\xc2\xa1\xa1ABC'.decode("utf-8", "myreplace"))

Output:

C:\>py -2.7 test.py
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]
¡\xa1ABC

C:\>py -3.3 test.py
3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)]
¡\xa1ABC

C:\>py -3.6 test.py
3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)]
¡\xa1ABC
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251