How to create groups of N elements from a PCollection Apache Beam Python

Question

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow

The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction.

Specifically I have this:

 p = beam.Pipeline (options = pipeline_options)
 lines = p | 'File reading' >> ReadFromText (known_args.input)

After this, I need to create another PCollection but with a List of N rows of "lines" since my use case requires a group of rows. I can not operate line by line.

I tried a ParDo Function using variables for count associating with the counter N rows and after groupBy using Map. But these are reset every 1000 records, so it's not the solution I am looking for. I read the example in the link but I do not know how to do something like that in Python.

I tried saving the counters in Datastore, however, the speed difference between Dataflow reading and writing with Datastore is quite significant.

What is the correct way to do this? I don't know how else to approach it. Regards.

Could you rephrase what do you want to achieve? Do you want to be able to have whole input file (all lines) as single list? — Marcin Zablocki, Mar 26 '18 at 19:49
Hi @MarcinZablocki, no, I want a PCollection of List with N rows from the input file, for example: if N is 2 and the input is "1,2,3,4,5,6,7,8" where the comma is a jump of line, I want a PCollection that is something like this: PCollection[List(1,2), List(3,4), List(5,6), List(7,8)] — Luis Felipe Muñoz, Mar 26 '18 at 21:07
So what if the input is "1,2,3,4,5,6,7" and N=2? How should the output PCollection look like? — Arjun Kay, Mar 27 '18 at 04:57
PCollection is **unordered**. Unless your input contains the order information (say `ReadFromText` returns tuples of `(sequence number, element)`), this kind of deterministic grouping is tricky to do with beam (need `State` or data-driven triggers). If your pipeline doesn't require deterministic grouping, you can maintain a buffer of size N in your DoFn and flush the buffer every time when it's full (or in `finish_bundle`). — Jiayuan Ma, Mar 27 '18 at 19:42
[This](https://stackoverflow.com/questions/48267159/beam-dataflow-2-2-0-extract-first-n-elements-from-pcollection) question seems to be similar to yours - the answer was to use the [Top transform](https://beam.apache.org/documentation/sdks/javadoc/2.2.0/org/apache/beam/sdk/transforms/Top.html). — Lefteris S, Apr 03 '18 at 12:02
The order isn't important, the important is have in the output tuples of N registers, the question is transform for example, a PCollection in a List of PCollections of N rows. — Luis Felipe Muñoz, Apr 05 '18 at 16:43

score 10 · Answer 1 · answered Apr 06 '18 at 05:03

Assume the grouping order is not important, you can just group inside a DoFn.

class Group(beam.DoFn):
  def __init__(self, n):
     self._n = n
     self._buffer = []

  def process(self, element):
     self._buffer.append(element)
     if len(self._buffer) == self._n:
        yield list(self._buffer)
        self._buffer = []

  def finish_bundle(self):
     if len(self._buffer) != 0:
        yield list(self._buffer)
        self._buffer = []

lines = p | 'File reading' >> ReadFromText(known_args.input)
          | 'Group' >> beam.ParDo(Group(known_args.N)
          ...

How to create groups of N elements from a PCollection Apache Beam Python

1 Answers1

Linked