0

I can't correctly read excel data with accented characters with pandas.

data = pd.read_excel("C:/Users/XXX/Desktop/Help_me_plz.xlsx", encoding='utf-8')

This what I obtain:

    ID  Titre   Entité
0   2020044459  SOAPPRO - Problème ouverture documents Root entity > Utilisateurs
1   2020048819  Probleme de conformité Smartphone KMSE Root entity > Utilisateurs

As you can see accent are not correctly interpreted and appeared as weird characters.

I searched on the Internet and tried several things:

  • Convert the files in csv

  • Convert file in various encoding type

  • Open the the file with notepad but the problem is still here

  • I even tried to use the following code which return wrong output:

    from unidecode import unidecode
    print(unidecode('Entité'))
    

I was expecting Entité but it gave me the following output: EntitA(c).

Is there a way to interpret correctly accent or identify the right encoding to use?

martineau
  • 119,623
  • 25
  • 170
  • 301
Icy
  • 31
  • 1
  • 8

2 Answers2

0

You can't unidecode('Entité') because it's already decoded as 'Entité'.

You need to fix the data at the source which seems to be your spreadsheet.

Have a look at Are XLSX files UTF-8 encoded by definition?

And also: https://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html

The encoding='utf-8' parameter is passed to an underlying pands.io.parsers.TextFileReader object which blindly accepts that you know your file is encoded in UTF-8 which doesn't seem to be the case here.

Try utf-16 or latin-1 and see if the results change. The way you need to deal with this is to figure out what encoding the file actually uses.

The XLSX format is a zipped XML document. Change the extension to zip, open it up and check the encoding in the XML data.

You could write accompanying code to determine encoding for you in future.

razodactyl
  • 364
  • 4
  • 9
  • Thanks for those useful informations. I tried both utf-16 and latin-1, nothing change. I change extension to zip found the following information : `` It seems that utf-8 was the right encoding. Sould i conclude that my problem can't be resolve by aplying the right encoding to read my file ? – Icy Jul 03 '20 at 08:10
0

Hmm, what you show is a hint that you have correctly processed the Excel file, but the problem occurs at display time. Long story short, this is what you see when you look at an UTF-8 encoded file in a Latin1 (or Windows cp1252) terminal or editor.

Demo:

>>> print('Problème'.encode().decode('latin1'))
Problème
>>> print('Entité'.encode().decode('latin1'))
Entité

So you should show the code that produces that display, the problem is there...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252