In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline.
First I used following code to read CSV files:
implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(s3Path)
}
}
But when I read DF using this method all the Japanese characters are garbled.
After doing some tests I found out that If I read the same S3 file using "spark.sparkContext.textFile(path)" Japanese characters encoded properly.
So I tried this way :
implicit class SparkSessionImplicits (spark : SparkSession) {
def readTeradataCSV(schema: StructType, s3Path: String) = {
import spark.sqlContext.implicits._
spark.read.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.schema(schema)
.csv(spark.sparkContext.textFile(s3Path).map(str => str.replaceAll("\u000D"," ")).toDS())
}
}
Now the encoding issue is fixed.However multilines doesn't work properly and lines are broken near ^M character , even though I tried to replace ^M using str.replaceAll("\u000D"," ")
Any tips on how to read Japanese characters using first method, or handle multi-lines using the second method ?
UPDATE: This encoding issue happens when the app runs on the Spark cluster.When I ran the app locally, reading the same S3 file, encoding works just fine.