1

I am trying to open a file like that:

with open("myfile.txt", encoding="utf-8") as f:

but myfile.txt comes from my application's users. And 90% of the times, this file comes as non UTF-8 which causes the application to exit because it failed to read it properly. The error is like 'utf-8' codec can't decode byte 0x9c

I've Googled about it and found some Stackoverflow answers that say to open my file like that:

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:

but other answers said to use:

with open("myfile.txt", encoding="utf-8", errors="replace") as f:

So what is the difference between errors="replace" and errors="surrogateescape" and which one will fix the non UTF-8 bytes in the file?

gabugu
  • 105
  • 3
  • 9
  • Read [this](https://stackoverflow.com/questions/21116089/surrogateescape-cannot-escape-certain-characters) and [this](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#unicode-basics) – Sheldore Jun 04 '19 at 09:46
  • @Sheldore - I don't want to ignore the error, I want to replace the corrupted bytes. This isn't a duplicate. – gabugu Jun 04 '19 at 09:56
  • "And 90% of the times, this file comes as non UTF-8" - then you shouldn't be trying to read it as UTF-8. No value of the `errors` argument will fix that. You'll just get a pile of nonsense. – user2357112 Jun 04 '19 at 10:00
  • @user2357112 - I already do `encoding="utf-8"` but same problem occurs. – gabugu Jun 04 '19 at 10:01
  • Specifying `encoding="utf-8"` for a non-UTF-8 file makes no sense. You need to specify the encoding the file is actually encoded in. – user2357112 Jun 04 '19 at 10:04
  • Ok, vote retracted – Sheldore Jun 04 '19 at 10:06

2 Answers2

6

The doc says:

'replace': Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding. Implemented in replace_errors().
...
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)

That means that with replace, any offending byte will be replaced with the same U+FFFD REPLACEMENT CHARACTER, while with surrogateescape each byte will be replaced with a different value. For example a '\xe9' would be replaced with a '\udce9' and '\xe8' with '\udce8'.

So with replace, you get valid unicode characters, but lose the original content of the file, while with surrogateescape, you can know the original bytes (and can even rebuild it exactly with .encode(errors='surrogateescape')), but your unicode string is incorrect because it contains raw surrogate codes.

Long story short: if the original offending bytes do no matter and you just want to get rid of the error, replace is a good choice, and if you need to keep them for later processing, surrogateescape is the way to go.


surrogateescape has a very nice feature when you have files containing mainly ascii characters and a few (accented) non ascii ones. And you also have users which occasionaly modify the file with a non UTF8 editor (or fail to declare the UTF8 encoding). In that case, you end with a file containing mostly utf8 data and some bytes in a different encoding, often CP1252 for windows users in non English west european language (like French, Portugues of Spanish). In that case it is possible to build a translation table that will map surrogate chars to their equivalent in cp1252 charset:

# first map all surrogates in the range 0xdc80-0xdcff to codes 0x80-0xff
tab0 = str.maketrans(''.join(range(0xdc80, 0xdd00)),
             ''.join(range(0x80, 0x100)))
# then decode all bytes in the range 0x80-0xff as cp1252, and map the undecoded ones
#  to latin1 (using previous transtable)
t = bytes(range(0x80, 0x100)).decode('cp1252', errors='surrogateescape').translate(tab0)
# finally use above string to build a transtable mapping surrogates in the range 0xdc80-0xdcff
#  to their cp1252 equivalent, or latin1 if byte has no value in cp1252 charset
tab = str.maketrans(''.join(chr(i) for i in range(0xdc80, 0xdd00)), t)

You can then decode a file containing a mojibake of utf8 and cp1252:

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:
    for line in f:                     # ok utf8 has been decoded here
        line = line.translate(tab)     # and cp1252 bytes are recovered here

I have successfully used that method several times to recover csv files that were produced as utf8 and had been edited with Excel on Windows machines.

The same method could be used for other charsets derived from ascii

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • This is all assuming the input is mostly UTF-8 with a bit of corrupted garbage, though. If it's UTF-16 or Shift JIS or something, you're not going to successfully read the file either way until you specify the right encoding. – user2357112 Jun 04 '19 at 10:08
  • `but lose the original content of the file` what does this mean? – gabugu Jun 04 '19 at 10:10
  • @user2357112: above will ensure that all ascii characters will be preserved, and for UTF16, all ascii characters are represented as themselves and a null. So if the file contains mainly ascii it should be interpretable. Of course if it is binary or east language charset it will just be garbage. – Serge Ballesta Jun 04 '19 at 10:14
  • @gabugu: reading a file will not destroy it. but when you find a replacement character, you can no longer guess what its original byte was in the file. – Serge Ballesta Jun 04 '19 at 10:15
  • But `replace` isn't like `errors="ignore"` right? so it won't fully ignore the non UTF-8 line like `ignore`? – gabugu Jun 04 '19 at 10:19
  • 1
    `ignore` does not ignore *lines*. Simply offending (non utf8) characters are silently discarded. With `replace` you cannot know what they were but at least know where they were. – Serge Ballesta Jun 04 '19 at 10:22
0

My problem was that the file had lines with mixed encodings types.

The fix was to remove encoding="utf-8" and add errors="replace".. So the open() line would be like that at the end:

with open("myfile.txt", errors="replace") as f:

If it was possible to detect the encoding type of a file, I'd have added it as an encoding parameter, but unfortunately it is not possible to detect it.

gabugu
  • 105
  • 3
  • 9
  • If you really have mixed encodings, the [`ftfy` library](https://pypi.org/project/ftfy/) might help you. – lenz Jun 04 '19 at 13:40