Setting up Bloom filter with PyArrow

Asked Nov 14 '22 at 16:51

Active Nov 14 '22 at 16:51

Viewed 247 times

I'm writing some datasets to parquet using pyarrow.parquet.write_to_dataset(). Now I'm trying to enable the bloom filter when writing (located in the metadata), but I can find no way to do this. I know in Spark you can do something like

spark.sql(“set parquet.filter.bloom.enabled=true”)
spark.sql(“set parquet.filter.columnindex.enabled=false”)
spark.sql(“set parquet.filter.stats.enabled=false”)

as done in this thread.

Is there a way to do this with PyArrow or some other library?

Currently I am writing the dataset with

import pyarrow.parquet as pq

pq.write_to_dataset(table=table,
                    root_path=output_file,
                    filesystem=fsys,
                    schema=schema
                    )

asked Nov 14 '22 at 16:51

sancholp

1

Spark uses parquet-mr (Java implementation of parquet). Pyarrow uses parquet-c++. I don't think the C++ implementation of parquet has bloom filter support: https://issues.apache.org/jira/browse/PARQUET-1327 – Pace Nov 15 '22 at 15:03

Setting up Bloom filter with PyArrow

0 Answers0