I am using pyspark to do word count thing. There are lots of Chinese word. I want to save the result into file, but saveAsTextFile() can not write correct Chinese character into file. that's my code
sc = SparkContext()
# read files
dir_path = '/Users/vera/learn/data_mining/caoz'
file_rdd = sc.wholeTextFiles(dir_path,use_unicode=True)
counts = file_rdd.map(lambda (k,v):v).\
flatMap(lambda line:line.split('\n')).\
map(lambda word:(word,1)).\
reduceByKey(lambda a,b:a+b).\
sortBy(lambda a: -a[1])
counts.saveAsTextFile('counts')
the output is utf-8 code like '\x**', it's not the Chinese character. I tried encode & decode, both of them dont work. so, i'd like to know how to deal with it or saveAsTextFile() could not handle Chinese character?