PySpark read GBK HDFS contain Chinese characters

Asked Jun 09 '17 at 05:07

Active Jun 11 '17 at 04:20

Viewed 520 times

I have many big HDFS files encoded in GBK, but these files contain special characters including Chinese etc. These Chinese strings would be shown or saved into a file. How can I handle the problem?

PySpark supports UTF-8 reader only.

Spark version: 2.0.0.
Hadoop version:2.7
Python 2.7

Added as follows:

The result will be saved into file, then the result file will be used in another system, SDK for example. I printed one word, just like u'\ufffd\u0439\ufffd', which is obviously invalid.

edited Jun 11 '17 at 04:20

asked Jun 09 '17 at 05:07

MartinGau

You want to remove the or you want to display them? You can display them as Unicode characters if you want. something like this: `u'\u8bf7\u8f93'` – philantrovert Jun 09 '17 at 08:49
The result will be saved into file, then the result file will be used in another system, SDK for example. I printed one word, just like u'\ufffd\u0439\ufffd', which is obviously invalid. – MartinGau Jun 11 '17 at 04:15
Problem solved as follow: create a temporary hive table encoded utf-8. – MartinGau Jul 31 '17 at 01:57
Same problem here, could you share the solution with more details?@MartinGau – fishiwhj Jan 04 '18 at 07:46

PySpark read GBK HDFS contain Chinese characters

0 Answers0