Problems with utf-8 in python

Question

My code is below.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import codecs

df1 = pd.read_csv(r'E:\내논문자료\wordcloud\test1\1311_1312.csv',encoding='utf-8')

df2 = df1.groupby(['address']).size().reset_index()
df2.rename(columns = {0: 'frequency'}, inplace = True)
print(df2[:100])

But When I execute this code I got this message

Traceback (most recent call last):
File "E:/빅데이터 캠퍼스/untitled1/groupby freq.py", line 7, in <module>
df1 = pd.read_csv(r'E:\내논문자료\wordcloud\test1\1311_1312.csv',encoding='utf-8')
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 400, in _read
data = parser.read()
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 938, in read
ret = self._engine.read(nrows)
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 1507, in read
 data = self._reader.read(nrows)
File "pandas\parser.pyx", line 846, in pandas.parser.TextReader.read (pandas\parser.c:10364)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10640)
File "pandas\parser.pyx", line 945, in pandas.parser.TextReader._read_rows (pandas\parser.c:11677)
File "pandas\parser.pyx", line 1047, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:13111)
File "pandas\parser.pyx", line 1106, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:14065)
File "pandas\parser.pyx", line 1204, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:16121)
File "pandas\parser.pyx", line 1220, in pandas.parser.TextReader._string_convert (pandas\parser.c:16349)
File "pandas\parser.pyx", line 1452, in pandas.parser._string_box_utf8 (pandas\parser.c:22014)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 0: invalid start byte

How can I solve it?? Should I alter parsers code in pandas??

It says that the utf-8 encoding can not decode byte 0xbc. Are you sure it is utf-8? You can use chartet if you really do know which encoding is used. Can you post the beginning of your file? — Guillaume Jacquenot, Dec 11 '16 at 09:15
Your input data is not UTF-8. Either fix your CSV file to be UTF-8, or figure out the correct codec and use that instead. We can't see your source data however, so we can't help here. — Martijn Pieters, Dec 11 '16 at 09:20
The character you can not decode is ¼, that is encoded as 0xbc in ascii-like encoding. You should change your encoding to such an encoding. — Guillaume Jacquenot, Dec 11 '16 at 09:50

score 1 · Answer 1 · edited May 23 '17 at 11:45

1

It looks like your source data hasn't been encoded with UTF-8 - it's likely to be one of the other codecs. Per this answer, you might want to try with encoding='GBK' to start with, or encoding='gb2312'.

edited May 23 '17 at 11:45

Community

1
1

answered Dec 11 '16 at 09:14

Withnail

3,128
2
30
47

Problems with utf-8 in python

1 Answers1