0

After using TextIO.read to get a PCollection<String> of the individual lines, is it possible to then use some kind of combine transform to into batches (groups of 25 for example)? So the return type would end up looking something like: PCollection<String, List<String>>. It looks like it should be possible using some kind of CombineFn, but the API is a little arcane to me still.

The context here is I'm reading CSV files (potentially very very large), parsing + processing the lines and turning them into JSON, and then calling a REST API... but I don't want to hit the REST API for each line individually because the REST API supports multiple items at a time (up to 1000, so not the whole batch).

chinabuffet
  • 5,278
  • 9
  • 40
  • 64
  • Have you tried "Data-driven triggers" with a large timing window? https://beam.apache.org/documentation/programming-guide/#data-driven-triggers . If i understand correctly, you expect to emit the entire batch once the aggregated element count meets a certain number. – greeness May 29 '18 at 19:49
  • I haven't tried that yet, but after reading it, I guess I'm not 100% on how to apply that. I'm going to update the originally question's example context/scenario with a little more detail on what I'm doing. – chinabuffet May 29 '18 at 20:13

1 Answers1

2

I guess you can do some simple batching like below (using stateful API). The state you want to maintain in BatchingFn is the current buffer of lines or self._lines. Sorry I did it in python (not familiar with Java API)

from apache_beam.transforms import DoFn
from apache_beam.transforms import ParDo

MY_BATCH_SIZE = 512

class BatchingFn(DoFn):
  def __init__(self, batch_size=100):
    self._batch_size = batch_size

  def start_bundle(self):
    # buffer for string of lines
    self._lines = []

  def process(self, element):
    # Input element is a string (representing a CSV line)
    self._lines.append(element)
    if len(_lines) >= self._batch_size:
      self._flush_batch()

  def finish_bundle(self):
    # takes care of the unflushed buffer before finishing
    if self._lines:
      self._flush_batch()

  def _flush_batch(self):
    #### Do your REST API call here with self._lines
    # .....
    # Clear the buffer.
    self._lines = []

# pcoll is your PCollection of lines.
(pcoll | 'Call Rest API with batch data' >> ParDo(BatchingFn(MY_BATCH_SIZE)))

Regarding using Data-driven triggers, you can refer to Batch PCollection in Beam/Dataflow.

greeness
  • 15,956
  • 5
  • 50
  • 80