Pyarrow Dataset read specific columns and specific rows

Question

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

Yes, to reading specific columns, that's one of the strengths of the Parquet format. In general, with `pd.read_parquet()` you can specify the columns with the columns arg. To my knowledge you can't filter on load. — leroyJr, Sep 10 '19 at 22:12
You can also filter a dataset when reading, but for now only in case of a partitioned dataset (consistent of multiple files in nested directories, see the `filter` argument in the docs https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html). To also filter within a single file is being worked on (see https://issues.apache.org/jira/browse/ARROW-1796) — joris, Sep 11 '19 at 13:40
See also the aswer to this file: https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset/56562285?noredirect=1#comment99829368_56562285 — joris, Sep 11 '19 at 13:40

Asclepius · Accepted Answer · 2021-10-20T20:30:05.587

5

As of pyarrow==2.0.0, this is possible at least with pyarrow.parquet.ParquetDataset.

To read specific columns, its read and read_pandas methods have a columns option. You can also do this with pandas.read_parquet.

To read specific rows, its __init__ method has a filters option.

edited Oct 20 '21 at 20:30

answered Oct 23 '20 at 15:19

Asclepius

Excellent! Hmm I wonder if all rows are read into memory before filtering out the bad ones. If so, the memory pressure still spikes. though i bet the arrow table is smaller than the pandas df – Kermit May 14 '22 at 02:30

1 Answers1