0

we have bounded data , around 3.5 million records in BigQuery. These data needs to be processed using Dataflow (mostly it is some external API calls + transformations)

From the document - https://cloud.google.com/dataflow/docs/resources/faq#beam-java-sdk

I see Batch mode uses single thread and stream uses 300 threads per worker.For us, most of my operation is Network bound because of external API calls.

  1. Considering this, which one would be more performant and cost efficient ? Batch - by spinning x workers or Stream with x workers and 300 threads.

  2. If it is streaming then I should send the data which is present in BigQuery to pub/sub ? Is my understanding correct ?

Sunil
  • 311
  • 1
  • 10

2 Answers2

2

The Batch vs Streaming decision usually comes from the source that you are reading from (Bounded vs Unbounded). When reading from BigQueryIO, it comes is bounded.

There are ways to convert from a BoundedSource to an UnboundedSource) (see Using custom DataFlow unbounded source on DirectPipelineRunner) but I don't see it recommended anywhere, and I am not sure you would get any benefit from it. Streaming has to keep track of checkpoints and watermarks, which could result in an overhead for your workers.

Bruno Volpato
  • 1,382
  • 10
  • 18
  • ok. but do I need consider the fact that streaming run 300 threads per worker ? wont streaming be performant here ? – Sunil Aug 24 '22 at 20:02
  • Fair, it seems you would have to parallelize further to suit your use case. Maybe [GroupIntoBatches](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/GroupIntoBatches.html) could be used and `.parallelStream()` inside the ParDo, or maybe create your own parallelization (e.g., ExecutorService) and control the lifecycle using [@StartBundle](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/DoFn.StartBundle.html) / [@FinishBundle](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/DoFn.FinishBundle.html) – Bruno Volpato Aug 24 '22 at 21:51
  • 1
    Formatting is weird in comments, my idea is doing something like this: https://gist.github.com/bvolpato/3de9fa33293fb3b3ad19fbd169c365e3 – Bruno Volpato Aug 24 '22 at 21:52
  • On Dataflow, you can force the pipeline to run in streaming mode by passing the `--streaming` option. This may or may not be faster. Using a threadpool in a single ParDo as suggested is another option (though note that elements must only be emitted on the calling thread). – robertwb Aug 25 '22 at 23:57
0

Here is an example of a DoFn that processes multiple items concurrently:

  class MultiThreadedDoFn(beam.DoFn):
    def __init__(self, func, num_threads=10):
      self.func = func
      self.num_threads = num_threads

    def setup(self):
      self.done = False
      self.input_queue = queue.Queue(2)
      self.output_queue = queue.Queue()
      self.threads = [
          threading.Thread(target=self.work, daemon=True)
          for _ in range(self.num_threads)]
      for t in self.threads:
        t.start()

    def work(self):
      while not self.done:
        try:
          windowed_value = self.input_queue.get(timeout=0.1)
          self.output_queue.put(
              windowed_value.with_value(func(windowed_value.value)))
        except queue.Empty:
          pass  # check self.done

    def start_bundle(self):
      self.pending = 0

    def process(self, element,
                timestamp=beam.DoFn.TimestampParam,
                window=beam.DoFn.WindowParam):
      self.pending += 1
      self.input_queue.put(
          beam.transforms.window.WindowedValue(
              element, timestamp, (window,)))
      try:
        while not self.output_queue.empty():
          yield self.output_queue.get(block=False)
          self.pending -= 1
      except queue.Empty:
        pass

    def finish_bundle(self):
      while self.pending > 0:
        yield self.output_queue.get()
        self.pending -= 1

    def teardown(self):
      self.done = True
      for t in self.threads:
        t.join()

It can be used as

  def func(n):
    time.sleep(n / 10)
    return n + 1

  with beam.Pipeline() as p:
    p | beam.Create([1, 3, 5, 7] * 10 + [9]) | beam.ParDo(MultiThreadedDoFn(func)) | beam.Map(logging.error)
robertwb
  • 4,891
  • 18
  • 21