How to change ZSTD compression level for files written via Spark?

Question

It is stated that the default zstd compression level is 1 in Spark documentation. https://spark.apache.org/docs/latest/configuration.html

I set this property to different values both in spark-defaults.conf,

and inside the code like

val conf = new SparkConf(false)
conf.set("spark.io.compression.zstd.level", "22")
val spark = SparkSession.builder.config(conf).getOrCreate()
..

Reading same input and saving/writing it in parquet format with zstd compression multiple times did not change the output file size at all. How can one adjust this compression level in Spark?

zstd compression level 22 is in the --ultra territory. Just "go to eleven" and check if this works for you. — darked89, Dec 10 '21 at 15:17
What you pointed out is true for the command line zstd tool. However, it matters not which value you set within Spark since it uses an open source zstd JNI implementation and those things are handled probably in https://github.com/luben/zstd-jni/blob/5ae1cf6b3cee822b78cc2a052dcf0a294b2946db/src/main/native/jni_zstd.c — belce, Dec 13 '21 at 15:20
The thing is level parameter of setLevel function here ,https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala which is read from SparkConf as conf.get(IO_COMPRESSION_ZSTD_LEVEL) does not seem to take effect somehow. I tested it with different values including ones that are below ultra territory. — belce, Dec 13 '21 at 15:22
Just curious, did someone try to launch a fairly large spark pipeline (at least with some joins) using `spark.io.compression.codec="zstd"` and `spark.io.compression.zstd.level` set to more than `20`? :) — ei-grad, Nov 01 '22 at 13:43

ei-grad · Answer 1 · 2023-08-09T15:23:43.133

The parameter spark.io.compression.zstd.level is about the codec used to compress an intermediate files - serialized RDDs, shuffle, broadcast, checkpoints. In most cases the only important thing there is the compression speed, so default 1 would be the best choice (one also should set spark.io.compression.codec to zstd, for this parameter to have effect).

Sadly, there is no ability to specify a compression level for Parquet codec specified by spark.sql.parquet.compression.codec in Spark.

Starting from Spark 3.2 (with parquet-mr>=1.12.0), there is parquet.compression.codec.zstd.level option, but it doesn't seem to work:

In [5]: for i in [1, 5, 10]: df.write.option('parquet.compression.codec.zstd.level', i
   ...: ).parquet(f"test-{i}.parquet", compression='zstd', mode='overwrite')
                                                                                
In [6]: !du -sh test-*.parquet
40M test-10.parquet
40M test-1.parquet
40M test-5.parquet

As a workaround, one could use the Parquet implementation from the arrow project (directly in C++, or via pyarrow / go / etc; it allows to specify compression_level for each column's codec, as well as default compression_level value) to repack the data before writing it into the warehouse.

Sadly, the arrow-rs Parquet implementation doesn't allow to specify the compression_level too. But luckily, parquet2 wich is used in arrow2 (transmute-free rust implementation of arrow) - does.

score 1 · Answer 2 · answered Dec 30 '22 at 18:00

1

you can use something like:

df.write
 .option("parquet.compression.codec.zstd.level", "22")
 .parquet("/your/output/dir")

more details in this jira: https://issues.apache.org/jira/browse/SPARK-39743

answered Dec 30 '22 at 18:00

dafu

11
1

It seems that `parquet.compression.codec.zstd.level` is not used in flow of how parquet zstd compressor is initialized in Spark. Mentioned issue has been marked to resolved by PR, which clarifies `spark.io.compression.zstd.level` in docs :-(. – ei-grad Aug 09 '23 at 15:30

score 0 · Answer 3 · answered Nov 28 '22 at 16:28

You can change the level by configuring parquet.compression.codec.zstd.level eg ...config("parquet.compression.codec.zstd.level","3")

For more properties take a look in org.apache.parquet.hadoop.codec.ZstandardCodec.java

As the other mentioned 22 seems a little bit extreme. I would be curious what's your use case.

How to change ZSTD compression level for files written via Spark?

3 Answers3