2

I am trying to filter pyarrow data with pyarrow.dataset. I want to add a dynamic way to add to the expressions.

from pyarrow import parquet as pq
import pyarrow.dataset as ds
import datetime

exp1 = ds.field("IntCol") == 1
exp2 = ds.field("StrCol") == 'A'
exp3 = ds.field("DateCol") == datetime.date.today()

filters = (exp1 & exp2 & exp3)
print(filters)

#To be used in reading parquet tables
df = pq.read_table('sample.parquet', filters=filters)

How can do this without writing "&" there since I may have N number of exps? I have been looking at different ways to collect expressions like np.logical_and.accumulate(). It gets me partially there, but I still need to convert the array into a single expression.

np.logical_and.accumulate([exp1, exp2, exp3])

out: array([<pyarrow.dataset.Expression (IntCol == 1)>,
       <pyarrow.dataset.Expression (StrCol == "A")>,
       <pyarrow.dataset.Expression (DateCol == 2021-06-09)>], dtype=object)

going down numpy route may not be the best answer. Does anyone have suggestion whether this can be done?

Ted. Z
  • 23
  • 2

1 Answers1

2

You can use operator.and_ to have the functional equivalent of the & operator. And then with functools.reduce it can be recursively applied on a list of expressions.

Using your three example expressions:

import operator
import functools

>>> functools.reduce(operator.and_, [exp1, exp2, exp3])
<pyarrow.dataset.Expression (((IntCol == 1) and (StrCol == "A")) and (DateCol == 2021-06-10))>
joris
  • 133,120
  • 36
  • 247
  • 202