Pandas can't correctly interprete accent with UTF8 option

Question

I can't correctly read excel data with accented characters with pandas.

data = pd.read_excel("C:/Users/XXX/Desktop/Help_me_plz.xlsx", encoding='utf-8')

This what I obtain:

    ID  Titre   EntitÃ©
0   2020044459  SOAPPRO - ProblÃ¨me ouverture documents Root entity > Utilisateurs
1   2020048819  Probleme de conformitÃ© Smartphone KMSE Root entity > Utilisateurs

As you can see accent are not correctly interpreted and appeared as weird characters.

I searched on the Internet and tried several things:

Convert the files in csv
Convert file in various encoding type
Open the the file with notepad but the problem is still here
I even tried to use the following code which return wrong output:
```
from unidecode import unidecode
print(unidecode('EntitÃ©'))
```

I was expecting Entité but it gave me the following output: EntitA(c).

Is there a way to interpret correctly accent or identify the right encoding to use?

score 0 · Accepted Answer · answered Jul 02 '20 at 15:56

You can't unidecode('EntitÃ©') because it's already decoded as 'EntitÃ©'.

You need to fix the data at the source which seems to be your spreadsheet.

Have a look at Are XLSX files UTF-8 encoded by definition?

And also: https://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html

The encoding='utf-8' parameter is passed to an underlying pands.io.parsers.TextFileReader object which blindly accepts that you know your file is encoded in UTF-8 which doesn't seem to be the case here.

Try utf-16 or latin-1 and see if the results change. The way you need to deal with this is to figure out what encoding the file actually uses.

The XLSX format is a zipped XML document. Change the extension to zip, open it up and check the encoding in the XML data.

You could write accompanying code to determine encoding for you in future.

Thanks for those useful informations. I tried both utf-16 and latin-1, nothing change. I change extension to zip found the following information : `` It seems that utf-8 was the right encoding. Sould i conclude that my problem can't be resolve by aplying the right encoding to read my file ? — Icy, Jul 03 '20 at 08:10

score 0 · Answer 2 · answered Jul 02 '20 at 16:04

Hmm, what you show is a hint that you have correctly processed the Excel file, but the problem occurs at display time. Long story short, this is what you see when you look at an UTF-8 encoded file in a Latin1 (or Windows cp1252) terminal or editor.

Demo:

>>> print('Problème'.encode().decode('latin1'))
ProblÃ¨me
>>> print('Entité'.encode().decode('latin1'))
EntitÃ©

So you should show the code that produces that display, the problem is there...

Pandas can't correctly interprete accent with UTF8 option

2 Answers2