Skip metadata for large binary fields in fastparquet

Question

If a dataset has a column with large binary data (e.g. an image or a sound-wave data) then computing min/max statistics for that column becomes costly both in compute and storage requirements, despite being completely useless (querying these values by range obviously makes no sense).

This causes large, highly-partitioned, parquet datasets to have metadata that explodes in size. Is there a way to tell fastparquet to not compute statistics for some columns or does the Parquet format mandate these statistics exist for every column?

Note: with the latest version of pyarrow (>= 0.14) this is possible by specifying the `write_statistics` keyword. See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html — joris, Aug 08 '19 at 13:26

score 0 · Accepted Answer · answered Jul 17 '19 at 02:11

This is implemented in a stale PR which could either be merged sometime (it breaks compatibility with py2), or the relevant parts could be extracted. The PR provides a stats= arg to the writer, which can be used to pick which columns have their max/min computed, or all/none for True/False.

Skip metadata for large binary fields in fastparquet

1 Answers1