7

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow

The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction.

Specifically I have this:

 p = beam.Pipeline (options = pipeline_options)
 lines = p | 'File reading' >> ReadFromText (known_args.input)

After this, I need to create another PCollection but with a List of N rows of "lines" since my use case requires a group of rows. I can not operate line by line.

I tried a ParDo Function using variables for count associating with the counter N rows and after groupBy using Map. But these are reset every 1000 records, so it's not the solution I am looking for. I read the example in the link but I do not know how to do something like that in Python.

I tried saving the counters in Datastore, however, the speed difference between Dataflow reading and writing with Datastore is quite significant.

What is the correct way to do this? I don't know how else to approach it. Regards.

SU3
  • 5,064
  • 3
  • 35
  • 66
  • Could you rephrase what do you want to achieve? Do you want to be able to have whole input file (all lines) as single list? – Marcin Zablocki Mar 26 '18 at 19:49
  • Hi @MarcinZablocki, no, I want a PCollection of List with N rows from the input file, for example: if N is 2 and the input is "1,2,3,4,5,6,7,8" where the comma is a jump of line, I want a PCollection that is something like this: PCollection[List(1,2), List(3,4), List(5,6), List(7,8)] – Luis Felipe Muñoz Mar 26 '18 at 21:07
  • 1
    So what if the input is "1,2,3,4,5,6,7" and N=2? How should the output PCollection look like? – Arjun Kay Mar 27 '18 at 04:57
  • PCollection is **unordered**. Unless your input contains the order information (say `ReadFromText` returns tuples of `(sequence number, element)`), this kind of deterministic grouping is tricky to do with beam (need `State` or data-driven triggers). If your pipeline doesn't require deterministic grouping, you can maintain a buffer of size N in your DoFn and flush the buffer every time when it's full (or in `finish_bundle`). – Jiayuan Ma Mar 27 '18 at 19:42
  • 1
    [This](https://stackoverflow.com/questions/48267159/beam-dataflow-2-2-0-extract-first-n-elements-from-pcollection) question seems to be similar to yours - the answer was to use the [Top transform](https://beam.apache.org/documentation/sdks/javadoc/2.2.0/org/apache/beam/sdk/transforms/Top.html). – Lefteris S Apr 03 '18 at 12:02
  • The order isn't important, the important is have in the output tuples of N registers, the question is transform for example, a PCollection in a List of PCollections of N rows. – Luis Felipe Muñoz Apr 05 '18 at 16:43

1 Answers1

10

Assume the grouping order is not important, you can just group inside a DoFn.

class Group(beam.DoFn):
  def __init__(self, n):
     self._n = n
     self._buffer = []

  def process(self, element):
     self._buffer.append(element)
     if len(self._buffer) == self._n:
        yield list(self._buffer)
        self._buffer = []

  def finish_bundle(self):
     if len(self._buffer) != 0:
        yield list(self._buffer)
        self._buffer = []

lines = p | 'File reading' >> ReadFromText(known_args.input)
          | 'Group' >> beam.ParDo(Group(known_args.N)
          ...
Jiayuan Ma
  • 1,891
  • 2
  • 13
  • 25