Group elements in Apache Beam pipeline

Question

I have got a pipeline that parses records from AVRO files.

I need to split the incoming records into chunks of 500 items in order to call an API that takes multiple inputs at the same time.

Is there a way to do this with the Python SDK?

Pablo · Accepted Answer · 2017-07-31T18:04:13.330

3

I'm supposing that you mean a Batch use case. You have a couple options for this:

If your PCollection is large enough, and you have some flexibility on the size of your bundles, you can use a GroupByKey transform after assigning keys in random/round robin order to your elements. e.g.:

my_collection = p | ReadRecordsFromAvro()

element_bundles = (my_collection 
                     # Choose a number of keys that works for you (I chose 50 here)
                   | 'AddKeys' >> beam.Map(lambda x: (randint(0, 50), x))
                   | 'MakeBundles' >> beam.GroupByKey()
                   | 'DropKeys' >> beam.Map(lambda (k, bundle): bundle)
                   | beam.ParDo(ProcessBundlesDoFn()))

Where ProcessBundlesDoFn is something like so:

class ProcessBundlesDoFn(beam.DoFn):
  def process(self, bundle):
    while bundle.has_next():
      # Fetch in batches of 500 until you're done
      result = fetch_n_elements(bundle, 500)
      yield result

If you need to have all bundles of exactly 500 elements, then you may need to:

Count the # of elements in your PCollection
Pass that count as a singleton side input to your 'AddKeys' ParDo, to determine exactly the number of keys that you will need.

Hope that helps.

edited Jul 31 '17 at 18:04

answered Jul 31 '17 at 17:33

Pablo

10,425
1
44
67

1

Thanks Pablo, that's what I ended up doing. Unfortunately, I want to parallelise this as much as possible and the number of "buckets" or random keys I could distribute the load in is difficult to calculate up front. The documentation for passing side inputs as singletons is very sparse. Thanks! – I Vazquez Jul 31 '17 at 23:17
Also, as a note, the number of keys that you choose will determine the parallelism in your pipeline, since each key is processed serially. For instance, with 50 keys your job will be unable to run in more than 50 machines (in practice, the final number of workers is much smaller than the number of keys). – Pablo Jul 31 '17 at 23:23
Also, I'll detail the side input approach if you need me to. – Pablo Aug 06 '17 at 21:17
Thanks Pablo! I've managed to do it pre calculating the number of items at DAG build time. – I Vazquez Aug 07 '17 at 23:05

Group elements in Apache Beam pipeline

1 Answers1