0

If a dataset has a column with large binary data (e.g. an image or a sound-wave data) then computing min/max statistics for that column becomes costly both in compute and storage requirements, despite being completely useless (querying these values by range obviously makes no sense).

This causes large, highly-partitioned, parquet datasets to have metadata that explodes in size. Is there a way to tell fastparquet to not compute statistics for some columns or does the Parquet format mandate these statistics exist for every column?

stav
  • 1,497
  • 2
  • 15
  • 40
  • Note: with the latest version of pyarrow (>= 0.14) this is possible by specifying the `write_statistics` keyword. See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html – joris Aug 08 '19 at 13:26

1 Answers1

0

This is implemented in a stale PR which could either be merged sometime (it breaks compatibility with py2), or the relevant parts could be extracted. The PR provides a stats= arg to the writer, which can be used to pick which columns have their max/min computed, or all/none for True/False.

mdurant
  • 27,272
  • 5
  • 45
  • 74