4

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

Asclepius
  • 57,944
  • 17
  • 167
  • 143
Punter Vicky
  • 15,954
  • 56
  • 188
  • 315
  • Yes, to reading specific columns, that's one of the strengths of the Parquet format. In general, with `pd.read_parquet()` you can specify the columns with the columns arg. To my knowledge you can't filter on load. – leroyJr Sep 10 '19 at 22:12
  • You can also filter a dataset when reading, but for now only in case of a partitioned dataset (consistent of multiple files in nested directories, see the `filter` argument in the docs https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html). To also filter within a single file is being worked on (see https://issues.apache.org/jira/browse/ARROW-1796) – joris Sep 11 '19 at 13:40
  • See also the aswer to this file: https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset/56562285?noredirect=1#comment99829368_56562285 – joris Sep 11 '19 at 13:40

1 Answers1

5

As of pyarrow==2.0.0, this is possible at least with pyarrow.parquet.ParquetDataset.

To read specific columns, its read and read_pandas methods have a columns option. You can also do this with pandas.read_parquet.

To read specific rows, its __init__ method has a filters option.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
  • Excellent! Hmm I wonder if all rows are read into memory before filtering out the bad ones. If so, the memory pressure still spikes. though i bet the arrow table is smaller than the pandas df – Kermit May 14 '22 at 02:30