I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N)
. So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left.
Is this possible in Apache Beam?
Asked
Active
Viewed 1,450 times
1

yurybubnov
- 357
- 2
- 11
-
Isn't `GroupIntoBatches` PTransform applicable here ? – Raheel Jun 16 '19 at 11:41
1 Answers
3
Edit, looks like: Google Dataflow "elementCountExact" aggregation
You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N)
. You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:
Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(N),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))
But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers
in Beam’s programming guide

rav
- 3,579
- 1
- 18
- 18