I have got a pipeline that parses records from AVRO files.
I need to split the incoming records into chunks of 500 items in order to call an API that takes multiple inputs at the same time.
Is there a way to do this with the Python SDK?
I have got a pipeline that parses records from AVRO files.
I need to split the incoming records into chunks of 500 items in order to call an API that takes multiple inputs at the same time.
Is there a way to do this with the Python SDK?
I'm supposing that you mean a Batch use case. You have a couple options for this:
If your PCollection is large enough, and you have some flexibility on the size of your bundles, you can use a GroupByKey
transform after assigning keys in random/round robin order to your elements. e.g.:
my_collection = p | ReadRecordsFromAvro()
element_bundles = (my_collection
# Choose a number of keys that works for you (I chose 50 here)
| 'AddKeys' >> beam.Map(lambda x: (randint(0, 50), x))
| 'MakeBundles' >> beam.GroupByKey()
| 'DropKeys' >> beam.Map(lambda (k, bundle): bundle)
| beam.ParDo(ProcessBundlesDoFn()))
Where ProcessBundlesDoFn
is something like so:
class ProcessBundlesDoFn(beam.DoFn):
def process(self, bundle):
while bundle.has_next():
# Fetch in batches of 500 until you're done
result = fetch_n_elements(bundle, 500)
yield result
If you need to have all bundles of exactly 500 elements, then you may need to:
'AddKeys'
ParDo, to determine exactly the number of keys that you will need.Hope that helps.