8

Apache Spark supposedly supports Facebook's Zstandard compression algorithm as of Spark 2.3.0 (https://issues.apache.org/jira/browse/SPARK-19112), but I am unable to actually read a Zstandard-compressed file:

$ spark-shell

...

// Short name throws an exception
scala> val events = spark.read.option("compression", "zstd").json("data.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.

// Codec class can be imported
scala> import org.apache.spark.io.ZStdCompressionCodec
import org.apache.spark.io.ZStdCompressionCodec

// Fully-qualified code class bypasses error, but results in corrupt records
scala> spark.read.option("compression", "org.apache.spark.io.ZStdCompressionCodec").json("data.zst")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

What do I need to do in order to read such a file?

Environment is AWS EMR 5.14.0.

Josh Johnson
  • 8,832
  • 4
  • 25
  • 31

1 Answers1

5

Per this comment, support for Zstandard in Spark 2.3.0 is limited to internal and shuffle outputs.

Reading or writing Zstandard files utilizes Hadoop's org.apache.hadoop.io.compress.ZStandardCodec, which was introduced in Hadoop 2.9.0 (2.8.3 is included in EMR 5.14.0).

Josh Johnson
  • 8,832
  • 4
  • 25
  • 31
  • 1
    I'm using Hadoop 3.2.2, but when trying to read a zstd, it gives me a java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support. Any ideas? Thanks – cnstlungu Apr 18 '21 at 20:37
  • me too @cnstlungu, I'm running hadoop 2.10 hadoop `checknative -a`, and seems that *zstd : false*, maybe zstd license is not fully open, and apache team decided to build without it ? – Diego Scaravaggi Apr 21 '21 at 13:10
  • 1
    @DiegoScaravaggi here's how I sorted it out https://stackoverflow.com/questions/67099204/reading-a-zst-archive-in-scala-spark-native-zstandard-library-not-available – cnstlungu Apr 22 '21 at 18:40
  • @cnstlungu , I think you are right, but I'm not using 3.x data platform, on my `2.10`, when I added **native** library I got `main org.apache.spark.sql.AnalysisException: java.lang.Uns atisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat` , for now I will postpone nativa library, I will arrange a test platform with **3.x**, and I'm waiting bigtop apache team for stable build 1.6 – Diego Scaravaggi Apr 23 '21 at 07:39