Python: UnicodeDecodeError: 'utf8' codec can't decode byte 0x91

Question

I'm parsing a CSV as follows:

with open(args.csv, 'rU') as csvfile:
        try:
            reader = csv.DictReader(csvfile, dialect=csv.QUOTE_NONE)
            for row in reader:
            ...

where args.csv is the name of my file. One of the rows in my file is an e with two dots on top. My script breaks when it encounters this.

I get the following stack trace:

File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 244, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)

and the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 5: invalid start byte

FWIW, I'm running Python 2.7 and upgrading isn't an option (for a few reasons).

I'm pretty lost about how to fix this so any help is much appreciated.

Thanks!

What if you try `with open(args.csv, 'rU', encoding='utf-8') as csvfile:` ? — DeepSpace, Jun 24 '16 at 17:53
You could add some data from the csv file maybe as hexdump. Could it be the file is not meaningfully interpretable as utf8 because it was encoded to bytes from some windows or other encodings? — Dilettant, Jun 24 '16 at 17:57
The dots are called an [umlaut](https://en.wikipedia.org/wiki/Diaeresis_(diacritic)) — Wayne Werner, Jun 24 '16 at 20:09
The error doesn't come from the code, it comes from call to `json.dumps` — Antti Haapala -- Слава Україні, Jun 24 '16 at 20:27
To handle your cp1252-encoded data please see the [Examples](https://docs.python.org/2/library/csv.html#examples) at the end of the CSV docs. Also, in Python 2 you should open csv files in binary mode, as mentioned near the start of those docs. — PM 2Ring, Jun 25 '16 at 02:27

score 10 · Answer 1 · answered Jun 24 '16 at 17:57

10

Byte 0x91 is a "smart" opening single quote in Windows-1252 encoding. So it sounds like that's the encoding your file is using, not UTF-8. So, use open(args.csv, 'rU', encoding='windows-1252').

answered Jun 24 '16 at 17:57

C. K. Young

219,335
46
382
435

When I follow your answer, I get: "TypeError: 'encoding' is an invalid keyword argument for this function". Fwiw, I'm running Python 2.7 and (for a few reasons) can't change that. – anon_swe Jun 24 '16 at 18:04
3

@bclayman It is preferable that you mention that in your question, even though it is mentioned in the stacktrace. – DeepSpace Jun 24 '16 at 18:07
1

Great answer! I managed to convert a file in Uzbek language to UTF-8 `iconv -t UTF-8 -f Windows-1252 in.xml` I would've spent a lot of time guessing what 0x91 and 0x92 character mean. – Boris Treukhov Feb 25 '18 at 18:46

Python: UnicodeDecodeError: 'utf8' codec can't decode byte 0x91

1 Answers1

Linked