0

I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.

When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.

But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.

run on spark-cluster (data is garbled)

spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap=  -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar

run locally (data looks fine)

spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap=  -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar

As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.

Reading CSV:

def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
        dataFrameReader.option("delimiter", "\u0001")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .option("charset", "UTF-8")
          .schema(schema)
          .csv(path)

     }

This is how I write to parquet:

finalDf.write
      .format("parquet")
      .mode(SaveMode.Append)
      .option("path", hdfsTablePath)
      .option("encoding", "UTF-8")
      .option("charset", "UTF-8")
      .partitionBy(parCols: _*)
      .save()

This is how data on HDFS looks like: enter image description here

Any tips on how to fix this ?

Does the input CSV file has to be in UTF-8 encoding ?

** Update ** Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :

Spark CSV reader : garbled Japanese text and handling multilines

Ashika Umanga Umagiliya
  • 8,988
  • 28
  • 102
  • 185
  • 1
    Hadoop libraries assume UTF-8 encoding by default. But pure Java libraries rely on `file.encoding` system property by default -- that property depends on OS settings and hard-coded settings (and can be overriden only as a command-line parameter, before the JVM has started). – Samson Scharfrichter May 18 '20 at 09:02
  • 1
    Anyway, from Spark source code (V3.0 ) for `CSVOptions` https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala the default encoding for CSV is UTF-8 but it can be overriden by either "encoding" or "charset" _(they are synonyms)_ `val charset = parameters.getOrElse("encoding", parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))` – Samson Scharfrichter May 18 '20 at 09:06
  • thank you, but we use Spark 2.4.1 – Ashika Umanga Umagiliya May 18 '20 at 09:08

1 Answers1

1

Parquet format has no option for encoding or charset cf. https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

Hence your code has no effect:

finalDf.write
      .format("parquet")
      .option("encoding", "UTF-8")
      .option("charset", "UTF-8")
(...)

These options apply only for CSV, you should set them (or rather ONE of them since they are synonyms) when reading the source file.
Assuming you are using the Spark dataframe API to read the CSV; otherwise you are on your own.

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • Recommended read about how Parquet manages "compression" of Strings: https://stackoverflow.com/questions/45488227/how-to-set-parquet-file-encoding-in-spark – Samson Scharfrichter May 18 '20 at 09:17
  • I tried reading the same S3 file using "spark.sparkContext.textFile(path)" and encoding works fine ! not sure whats happening inside the CSV Plugin – Ashika Umanga Umagiliya May 18 '20 at 11:13
  • seems the issue was not related to Parquet rather CSV loading. I asked a separate question here : https://stackoverflow.com/questions/61868668/spark-csv-reader-garbled-japanese-text-and-handling-multilines – Ashika Umanga Umagiliya May 18 '20 at 11:52