1

I want to shard Arrow Dataset. To achieve that, I'd like to use a monotonously increasing field and implement a sharding operation in the following filter, which I can use in pyarrow Scanner: pc.field('id') % num_shards == shard_id

Any ideas on how to do this using PyArrow compute API?

qwertz1123
  • 1,173
  • 10
  • 27
  • 1
    Unfortunately, modulo is not yet available as a compute function. There is a [PR](https://github.com/apache/arrow/pull/11116) for it but it seems to have gone stale. You can probably work around this with bit manipulation functions. I'll add an answer. – Pace Jan 04 '23 at 18:51

2 Answers2

1

Although there is not yet a modulo function there is a bit_wise_and function which can achieve the same thing:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc

arr = pa.array(range(100))
tab = pa.Table.from_arrays([arr], names=['x'])
my_filter = pc.bit_wise_and(pc.field('x'), 7) == 0
filtered = ds.dataset(tab).to_table(filter=my_filter)
print(filtered)
# pyarrow.Table
# x: int64
# ----
# x: [[0,8,16,24,32,...,64,72,80,88,96]]
Pace
  • 41,875
  • 13
  • 113
  • 156
0

Taking inspiration from Pace, this seems to work and for arbitrary divisor. It works for negative numbers too:

divisor = 5
arr = pa.array(range(-100, 100))
tab = pa.Table.from_arrays([arr], names=['x'])
my_filter = pc.subtract(pc.field("x"), pc.multiply(pc.divide(pc.field("x"), divisor), divisor)) == 0
filtered = ds.dataset(tab).to_table(filter=my_filter)
print(filtered)
# pyarrow.Table
# x: int64
# ----
# x: [[-100,-95,-90,-85,-80,...,75,80,85,90,95]]

Or, cleaned up a bit:

def pc_mod(field: str, divisor: int):
    return pc.subtract(pc.field(field), pc.multiply(pc.divide(pc.field(field), divisor), divisor)) == 0

print(ds.dataset(tab).to_table(filter=pc_mod("x", 5)))
# pyarrow.Table
# x: int64
# ----
# x: [[-100,-95,-90,-85,-80,...,75,80,85,90,95]]
santon
  • 4,395
  • 1
  • 24
  • 43