I have many big HDFS files encoded in GBK, but these files contain special characters including Chinese etc. These Chinese strings would be shown or saved into a file. How can I handle the problem?
PySpark supports UTF-8 reader only.
- Spark version: 2.0.0.
- Hadoop version:2.7
- Python 2.7
Added as follows:
The result will be saved into file, then the result file will be used in another system, SDK for example. I printed one word, just like u'\ufffd\u0439\ufffd', which is obviously invalid.