12

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.

    # -*- coding: utf-8 -*-
    import os
    import glob
    import jieba
    import jieba.analyse
    import csv
    import codecs  

    segList = []
    raw_data_path = 'monthly_raw_data/'
    file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]

    jieba.load_userdict("customized_dict.txt")

    for name in file_name:
        all_text = ""
        multi_line_text = ""
        with open(raw_data_path + name + ".txt", "r") as file:
            for line in file:
                if line != '\n':
                    multi_line_text += line
            templist = multi_line_text.split('\n')
            for text in templist:
                all_text += text
            seg_list = jieba.cut(all_text,cut_all=False)
            temp_text = []
            for item in seg_list:
                temp_text.append(item.encode('utf-8'))

            stop_list = []
            with open("stopwords.txt", "r") as stoplistfile:
                for item in stoplistfile:
                    stop_list.append(item.rstrip('\r\n'))

            text_without_stopwords = []
            for word in temp_text:
                if word not in stop_list:
                    text_without_stopwords.append(word)

            segList.append(text_without_stopwords)


    with open("results/result.csv", 'wb') as f:
        writer = csv.writer(f)
        writer.writerows(segList)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
flyingmouse
  • 1,014
  • 3
  • 13
  • 29
  • How do you detect the characters are "unreadable" .Do you open the csv-file with Excel? Look into it with a command line tool like `less`? Open it with a text editor? – Clemens Klein-Robbenhaar Dec 27 '15 at 15:15
  • Yes, I open it with Excel, if I change the file `result.csv` to `result.txt`, I can read all characters. It is very strange. – flyingmouse Dec 27 '15 at 15:21
  • 1
    Excel has an issue where it mangles special characters. Try opening result.csv in notepad++ for example and see if it's correct. – Untitled123 Dec 27 '15 at 15:29
  • 1
    I guess the characters are written properly, but excel reads them wrong; by default it assumes the `cp-1252` encoding. I do not have an excel at hand, but can you look around if you can see an option to set the encoding in Excels `Open file ..` dialog? (Or maybe it is called "Import Data" or the like.) – Clemens Klein-Robbenhaar Dec 27 '15 at 15:30
  • I've also heard of an issue with the xlsx format to a csv causing this also. The file itself should be fine, and it seems to be a non easily solved Excel issue. – Untitled123 Dec 27 '15 at 16:09
  • @Untitled123, if I open it with notepad, it is correct. – flyingmouse Dec 28 '15 at 02:59
  • Yeah, it's just an Excel issue, your files are safe. – Untitled123 Dec 28 '15 at 03:00
  • @ClemensKlein-Robbenhaar, there is no option like that. – flyingmouse Dec 28 '15 at 03:00
  • @Untitled123, so it is difficult to solve. I think I just output all files as .txt.... – flyingmouse Dec 28 '15 at 03:00
  • Technically, the csv is not the issue, but rather that Excel is the most common program used to open csv files. From your program's perspective, it really does not matter. – Untitled123 Dec 28 '15 at 03:01

2 Answers2

17

For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:

#!python2
#coding:utf8
import csv

data = [[u'American', u'美国人'],
        [u'Chinese', u'中国人']]

with open('results.csv','wb') as f:
    f.write(u'\ufeff'.encode('utf8'))
    w = csv.writer(f)
    for row in data:
        w.writerow([item.encode('utf8') for item in row])

Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:

#!python3
#coding:utf8
import csv

data = [['American', '美国人'],
        ['Chinese', '中国人']]

with open('results.csv', 'w', newline='', encoding='utf-8-sig') as f:
    w = csv.writer(f)
    w.writerows(data)

There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:

#!python2
#coding:utf8
import unicodecsv

data = [[u'American', u'美国人'],
        [u'Chinese', u'中国人']]

with open('results.csv', 'wb') as f:
    w = unicodecsv.writer(f ,encoding='utf-8-sig')
    w.writerows(data)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • I got an error using your first code (Python 2.x): `UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)` – flyingmouse Dec 28 '15 at 02:57
  • @flyingmouse, make sure you are encoding Unicode strings and not byte strings. If you call `.encode('utf8')` on a byte string, Python 2 will do an implicit `.decode('ascii')` to try to turn it into a Unicode string first. – Mark Tolonen Dec 28 '15 at 05:54
0

Here is another way kinda tricky:

#!python2
#coding:utf8
import csv

data = [[u'American',u'美国人'],
        [u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
    f.write(u'\ufeff'.encode('utf8'))
    w = csv.writer(f)
    for row in data:
        w.writerow([item.encode('utf8') for item in row])

This code block generate csv file encoded utf-8 .

  1. open file with notepad++ (or other Editor with encode feature)
  2. Encoding -> convert to ANSI
  3. save

Open file with Excel, it's OK.

masonshu
  • 35
  • 10