The doc says:
'replace':
Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding. Implemented in replace_errors().
...
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)
That means that with replace
, any offending byte will be replaced with the same U+FFFD
REPLACEMENT CHARACTER, while with surrogateescape
each byte will be replaced with a different value. For example a '\xe9'
would be replaced with a '\udce9'
and '\xe8'
with '\udce8'
.
So with replace, you get valid unicode characters, but lose the original content of the file, while with surrogateescape, you can know the original bytes (and can even rebuild it exactly with .encode(errors='surrogateescape')
), but your unicode string is incorrect because it contains raw surrogate codes.
Long story short: if the original offending bytes do no matter and you just want to get rid of the error, replace
is a good choice, and if you need to keep them for later processing, surrogateescape
is the way to go.
surrogateescape
has a very nice feature when you have files containing mainly ascii characters and a few (accented) non ascii ones. And you also have users which occasionaly modify the file with a non UTF8 editor (or fail to declare the UTF8 encoding). In that case, you end with a file containing mostly utf8 data and some bytes in a different encoding, often CP1252 for windows users in non English west european language (like French, Portugues of Spanish). In that case it is possible to build a translation table that will map surrogate chars to their equivalent in cp1252 charset:
# first map all surrogates in the range 0xdc80-0xdcff to codes 0x80-0xff
tab0 = str.maketrans(''.join(range(0xdc80, 0xdd00)),
''.join(range(0x80, 0x100)))
# then decode all bytes in the range 0x80-0xff as cp1252, and map the undecoded ones
# to latin1 (using previous transtable)
t = bytes(range(0x80, 0x100)).decode('cp1252', errors='surrogateescape').translate(tab0)
# finally use above string to build a transtable mapping surrogates in the range 0xdc80-0xdcff
# to their cp1252 equivalent, or latin1 if byte has no value in cp1252 charset
tab = str.maketrans(''.join(chr(i) for i in range(0xdc80, 0xdd00)), t)
You can then decode a file containing a mojibake of utf8 and cp1252:
with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:
for line in f: # ok utf8 has been decoded here
line = line.translate(tab) # and cp1252 bytes are recovered here
I have successfully used that method several times to recover csv files that were produced as utf8 and had been edited with Excel on Windows machines.
The same method could be used for other charsets derived from ascii