Unicode error handling with Python 3's readlines()

Question

I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?

UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position 7827: character maps to undefined.

This varies a _lot_ based on details. Python 2? Python 3? Are you trying to decode strings you already read? How? Etc. — Charles Duffy, May 07 '12 at 19:05
Okay -- updated the question to specify Python 3. Unicode is one of the places where there are very big differences between 2 and 3; please be sure to specify version explicitly in the future. — Charles Duffy, May 07 '12 at 20:22
For a more general case, it is probably worth looking at this: https://stackoverflow.com/questions/58199571/unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-invalid — Dragonthoughts, Jun 16 '20 at 11:04

Charles Duffy · Accepted Answer · 2012-05-08T14:50:15.280

79

In Python 3, pass an appropriate errors= value (such as errors=ignore or errors=replace) on creating your file object (presuming it to be a subclass of io.TextIOWrapper -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap (when you aren't sure, utf-8 is always a good place to start).

For instance:

f = open('misc-notes.txt', encoding='utf-8', errors='ignore')

In Python 2, the read() operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:

your_string.decode('utf-8', 'replace')

...to replace unhandled characters, or

your_string.decode('utf-8', 'ignore')

to simply ignore them.

That said, finding and using their real encoding (rather than guessing utf-8) would be preferred.

edited May 08 '12 at 14:50

answered May 07 '12 at 19:06

Charles Duffy

280,126
43
390
441

Minor nitpick: in Python 2, the trick is *decoding* them, not encoding. But you know that, because you're calling the `decode` method. – Thomas K May 07 '12 at 20:32
1

@ThomasK Oops. Shortened the verbiage -- fewer things to get wrong. Thanks for the proofread. :) – Charles Duffy May 08 '12 at 14:51
by passing encoding and errors parameters, it seems to be working. – Bob May 08 '12 at 18:43
Question: is there a way to check which encoding the file has been generated with? – Bob May 08 '12 at 18:48
@Bob sure -- just check `fileobj.encoding` (and `fileobj.errors` for the error-handling mode); should work as long as `fileobj` is a `TextIOWrapper`. – Charles Duffy May 08 '12 at 18:56

score 2 · Answer 2 · answered May 07 '12 at 19:26

2

You should open the file with a codecs to make sure that the file gets interpreted as UTF8.

import codecs
fd = codecs.open(filename,'r',encoding='utf-8')
data = fd.read()

answered May 07 '12 at 19:26

optixx

2,110
3
16
16

score -5 · Answer 3 · edited Dec 09 '14 at 20:30

-5

Yeah..you could wrap it in a

try:
    ....
except UnicodeEncodeError: 
    pass

edited Dec 09 '14 at 20:30

Charles Duffy

280,126
43
390
441

answered May 07 '12 at 19:04

cobie

7,023
11
38
60

5

yes, but that doesn't help much in terms of explaining how to proceed with reading the rest of the file. – Charles Duffy May 07 '12 at 20:21

Unicode error handling with Python 3's readlines()

3 Answers3

Linked

Related