Reading index based range from Parquet File using Python

Question

I'm trying to read a range of data (say row 1000 to 5000) from a parquet file. I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so.

Is there any way to achieve this?

Do you have rowgroups in your parquet file? – taras Nov 22 '20 at 13:43 — taras, Nov 22 '20 at 13:43

score 1 · Answer 1 · answered Nov 22 '20 at 14:08

I don't think the current pyarrow version (2.0) supports it.

The closest you can get with your file slicing is by using filters argument of read_table.

filters (List[Tuple] or List[List[Tuple]] or None (default)) – Rows which do not match the filter predicate will be removed from scanned data.

Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), > ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates

If your dataset has a column foo based on which you can get your required rows, use something like this:

import pyarrow.parquet as pq

table = pq.read_table(filename, filters=[('foo', '>', 0)])

If you happen to have a column id corresponding to the row index you can use

table = pq.read_table(filename, filters=[('id', '>', 1000), ('id', '<', 5000)])

Reading index based range from Parquet File using Python

1 Answers1