2

I am using pyspark to do word count thing. There are lots of Chinese word. I want to save the result into file, but saveAsTextFile() can not write correct Chinese character into file. that's my code

sc = SparkContext()
# read files
dir_path = '/Users/vera/learn/data_mining/caoz'
file_rdd = sc.wholeTextFiles(dir_path,use_unicode=True)
counts = file_rdd.map(lambda (k,v):v).\
        flatMap(lambda line:line.split('\n')).\
        map(lambda word:(word,1)).\
        reduceByKey(lambda a,b:a+b).\
        sortBy(lambda a: -a[1])
counts.saveAsTextFile('counts')

the output is utf-8 code like '\x**', it's not the Chinese character. I tried encode & decode, both of them dont work. so, i'd like to know how to deal with it or saveAsTextFile() could not handle Chinese character?

vera
  • 45
  • 1
  • 6

1 Answers1

0
counts.map(lambda x: x[0] + " " + str(x[1])).saveAsTextFile('counts')
Zhang Tong
  • 4,569
  • 3
  • 19
  • 38