how to use pyspark saveAsTextFile deal Chinese characters

Question

I am using pyspark to do word count thing. There are lots of Chinese word. I want to save the result into file, but saveAsTextFile() can not write correct Chinese character into file. that's my code

sc = SparkContext()
# read files
dir_path = '/Users/vera/learn/data_mining/caoz'
file_rdd = sc.wholeTextFiles(dir_path,use_unicode=True)
counts = file_rdd.map(lambda (k,v):v).\
        flatMap(lambda line:line.split('\n')).\
        map(lambda word:(word,1)).\
        reduceByKey(lambda a,b:a+b).\
        sortBy(lambda a: -a[1])
counts.saveAsTextFile('counts')

the output is utf-8 code like '\x**', it's not the Chinese character. I tried encode & decode, both of them dont work. so, i'd like to know how to deal with it or saveAsTextFile() could not handle Chinese character?

score 0 · Accepted Answer · answered Apr 18 '17 at 06:49

0

counts.map(lambda x: x[0] + " " + str(x[1])).saveAsTextFile('counts')

answered Apr 18 '17 at 06:49

Zhang Tong

4,569
3
19
38

it works. So elements of tuple or list should be changed into str ? – vera Apr 18 '17 at 07:32

how to use pyspark saveAsTextFile deal Chinese characters

1 Answers1