The parameter spark.io.compression.zstd.level
is about the codec used to compress an intermediate files - serialized RDDs, shuffle, broadcast, checkpoints. In most cases the only important thing there is the compression speed, so default 1
would be the best choice (one also should set spark.io.compression.codec
to zstd
, for this parameter to have effect).
Sadly, there is no ability to specify a compression level for Parquet codec specified by spark.sql.parquet.compression.codec
in Spark.
Starting from Spark 3.2 (with parquet-mr>=1.12.0
), there is parquet.compression.codec.zstd.level
option, but it doesn't seem to work:
In [5]: for i in [1, 5, 10]: df.write.option('parquet.compression.codec.zstd.level', i
...: ).parquet(f"test-{i}.parquet", compression='zstd', mode='overwrite')
In [6]: !du -sh test-*.parquet
40M test-10.parquet
40M test-1.parquet
40M test-5.parquet
As a workaround, one could use the Parquet implementation from the arrow
project (directly in C++, or via pyarrow / go / etc; it allows to specify compression_level
for each column's codec, as well as default compression_level
value) to repack the data before writing it into the warehouse.
Sadly, the arrow-rs
Parquet implementation doesn't allow to specify the compression_level
too. But luckily, parquet2
wich is used in arrow2
(transmute-free rust implementation of arrow) - does.