0

I have many big HDFS files encoded in GBK, but these files contain special characters including Chinese etc. These Chinese strings would be shown or saved into a file. How can I handle the problem?

PySpark supports UTF-8 reader only.

  • Spark version: 2.0.0.
  • Hadoop version:2.7
  • Python 2.7

Added as follows:

The result will be saved into file, then the result file will be used in another system, SDK for example. I printed one word, just like u'\ufffd\u0439\ufffd', which is obviously invalid.

MartinGau
  • 1
  • 1
  • You want to remove the or you want to display them? You can display them as Unicode characters if you want. something like this: `u'\u8bf7\u8f93'` – philantrovert Jun 09 '17 at 08:49
  • The result will be saved into file, then the result file will be used in another system, SDK for example. I printed one word, just like u'\ufffd\u0439\ufffd', which is obviously invalid. – MartinGau Jun 11 '17 at 04:15
  • Problem solved as follow: create a temporary hive table encoded utf-8. – MartinGau Jul 31 '17 at 01:57
  • Same problem here, could you share the solution with more details?@MartinGau – fishiwhj Jan 04 '18 at 07:46

0 Answers0