Control the compression level when writing Parquet files using Polars in Rust

Question

I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).

Below is my code to write a DataFrame to a Parquet file using the snappy compression.

let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy); 
pw.finish(&mut df);

And Python pandas (which leverages pyarrow) outputs Parquet files which have similar sizes as output Parquet files from Spark. — Benjamin Du, May 22 '22 at 20:39

ritchie46 · Accepted Answer · 2022-05-23T17:01:41.607

1

This is not (yet) possible in rust polars. It will likely be in next release of arrow2 and then we can implement it in polars as well.

If you want that functionality in python polars you can leverage pyarrow for this purpose. polars has zero copy interop with pyarrow.

edited May 23 '22 at 17:01

answered May 23 '22 at 13:41

ritchie46

10,405
1
24
43

Control the compression level when writing Parquet files using Polars in Rust

1 Answers1