I'm trying to read a range of data (say row 1000 to 5000) from a parquet file. I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so.
Is there any way to achieve this?
I'm trying to read a range of data (say row 1000 to 5000) from a parquet file. I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so.
Is there any way to achieve this?
I don't think the current pyarrow version (2.0) supports it.
The closest you can get with your file slicing is by using filters
argument of read_table
.
filters (List[Tuple] or List[List[Tuple]] or None (default)) – Rows which do not match the filter predicate will be removed from scanned data.
Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), > ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates
If your dataset has a column foo
based on which you can get your required rows, use something like this:
import pyarrow.parquet as pq
table = pq.read_table(filename, filters=[('foo', '>', 0)])
If you happen to have a column id
corresponding to the row index you can use
table = pq.read_table(filename, filters=[('id', '>', 1000), ('id', '<', 5000)])