1

I am trying to import an csv that contains Chinese characters.

this command is to download the csv file

!wget -O wm.csv https://raw.githubusercontent.com/hierarchyJK/compare-LIBSVM-with-Linear-and-Gassian-Kernel/master/%E8%A5%BF%E7%93%9C3.0.csv

The repository is not mine, so I am not sure if it is encoded the right way.

what I can be sure is that it renders correctly.

this code

pd.read_csv('wm.csv',encoding = 'utf-8')

causes this Error

'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte

I've searched this error, didn't find appropriate rca and solution.

this code executed properly

pd.read_csv('wm.csv',encoding = 'cp1252')

but renders the garbled

enter image description here

the system renders Chinese characters correctly.

enter image description here

with python open command

with open('wm.csv', 'r', encoding='cp1252') as f:
    for line in f.readlines():
        print(line)
        break

this code renders something garbled without any warning or error.

±àºÅ,É«Ôó,¸ùµÙ,ÇÃÉù,ÎÆÀí,Æê²¿,´¥¸Ð,ÃܶÈ,º¬ÌÇÂÊ,ºÃ¹Ï,Ðò¹ØÏµ
JJJohn
  • 915
  • 8
  • 26

3 Answers3

1

The encoding is 'GB18030'. I found this by opening the file in a text editor and checking the suggested encoding. Github actually also shows you the encoding when you go to the github link and click on edit file

NickHilton
  • 662
  • 6
  • 13
1

You should use the encoding="GBK". Hope this will help.

df = pd.read_csv('wm.csv', encoding="GBK")

More details check HERE

R.A.Munna
  • 1,699
  • 1
  • 15
  • 29
0

Here is a link with all of the standard encodings. Latin_1 have worked well for me when I have had issues, but in your case you can try utf_16_be. Good Luck.!

Standard Encodings

JJSSEE
  • 29
  • 6