I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.
When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.
But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.
run on spark-cluster (data is garbled)
spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
run locally (data looks fine)
spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.
Reading CSV:
def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(path)
}
This is how I write to parquet:
finalDf.write
.format("parquet")
.mode(SaveMode.Append)
.option("path", hdfsTablePath)
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.partitionBy(parCols: _*)
.save()
This is how data on HDFS looks like:
Any tips on how to fix this ?
Does the input CSV file has to be in UTF-8 encoding ?
** Update ** Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :
Spark CSV reader : garbled Japanese text and handling multilines