Read Zstandard compressed Parquet in Spark 2.4.7 Hadoop 2.7.7

Question

We have a limitation on a platform we are using which has Spark 2.4.7 and underneath Hadoop 2.7.7 libraries present. We have some data present on s3 which is in zstandard parquet format. Is there a way we can write a custome code of some kind to read this zstandard parquet in our job?

We dont have access to the infrastructure, so we can not install anything additionally on the machines. We can increase or decrease the executors (vertically and horizontally).

We have full control on job code and that's what we are required to manage and submit on the platform which submits the code to Spark and executes it.

When we try to read the file using spark.read.parquet("file path") we are getting this error: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.ZStandardCodec

This is obviously expected. When we include hadoop-common 2.9.1 dependency, which had zstandard codec support, we get another error stating: this version of libhadoop was built without zstd support

Is there a way to write a custome class to read the zstandard compressed parquet in to Spark data frame ?

FYI: I already checked some other SOF questions already and did not cover my use case. Specially due to restriction towards infrastructure access.

Spark's support for the Zstandard compression algorithm wasn't added until Hadoop 2.9 and Spark 2.4.3 (with Hadoop 2.9 underneath). So, if you're working with Spark 2.4.7 and Hadoop 2.7.7, without the ability to add anything to the infrastructure its kind of difficult but hopefully not impossible. Hope someone provides a solution to this. — mamonu, Jul 26 '23 at 12:54

stevel · Answer 1 · 2023-07-27T16:08:07.183

you can't mix hadoop- jar versions, any more than you can mix spark ones.
and hadoop native libs need to be in sync too.
ZStandard came with HADOOP-13578. Add Codec for ZStandard Compression, 2.9+ only

Accordingly, if you need Zstandard you need to be using a version of hadoop released in 2017 or later.

The only workarounds for this would be to:

backport the feature to your own private fork of hadoop-2.7, rebuild it and the native libraries, then deploy.
take the hadoop code, rename the packages, include any new dependencies in your redistributed JAR, somehow solve the libhadoop path problem.

Either path is doomed; the second one maybe possible if it wasn't for the native stuff.

Read Zstandard compressed Parquet in Spark 2.4.7 Hadoop 2.7.7

1 Answers1