0

I'm looking into grouping elements during the flow into batch groups that are grouped based on a batch size.

In Pseudo code:

PCollection[String].apply(Grouped.size(10))

Basically converting a PCollection[String] into PCollection[List[String]] where each list now contains 10 elements. As it is batch and in case it doesn't evenly divide the last batch would contain the left over elements.

I have two ugly ideas with windows and fake timestamps or a GroupBy using keys based on a random index to distribute evenly, but this seems like a to complex solution for the simple problem.

Elmar Weber
  • 2,683
  • 28
  • 28

1 Answers1

1

This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:

Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?

Partition data coming from CSV so I can process larger patches rather then individual lines

Community
  • 1
  • 1
Ben Chambers
  • 6,070
  • 11
  • 16
  • Thanks, that's easiest, didn't know that a bundle had the execution semantic of not being called in parallel on the process method. – Elmar Weber Mar 20 '16 at 20:54