2

In my pipeline I have to make a single RPC call as well as a Batched RPC call to fetch data for enrichment. I could not find any reference on how to make these call within your pipeline. I am still covering my grounds in Apache Beam and would appreciate if anyone has done this and could share a sample code or details on how to do this.

Thanks.

1 Answers1

1

Setting up the answer

I will help by writing example DoFns that do the things you want. I will write them in Python, but the Java code would be similar.

Let's suppose we have two functions that internally perform an RPC:

def perform_rpc(client, element):
  .... # Some Code to run one RPC for the element using the client

def perform_batched_rpc(client, elements):
  .... # Some code that runs a single RPC for a batch of elements using the client

Let's also suppose that you have a function create_client() that returns a client for your external system. We assume that creating this client is somewhat expensive, and it is not possible to maintain many clients in a single worker (due to e.g. memory constraints)

Performing a single RPC per element

It is usually fine to perform a blocking RPC for each element, but this may lead to low CPU usage

class IndividualBlockingRpc(DoFn):

  def setup(self):
    # Create the client only once per Fn instance
    self.client = create_client()

  def process(self, element):
    perform_rpc(self.client, element)

If you would like to be more sophisticated, you could also try to run asynchronous RPCs by buffering. Consider that in this case, your client would need to be thread safe:

class AsyncRpcs(DoFn):
  def __init__(self):
    self.buffer = None
    self.client = None

  def process(self, element):
    self.buffer.append(element)
    if len(self.buffer) > MAX_BUFFER_SIZE:
      self._flush()

  def finish(self):
    self._flush()

  def _flush(self):
    if not self.client:
      self.client = create_client()
    if not self.executor:
      self.executor = ThreadPoolExecutor() # Use a configured executor 

    futures = [self.executor.submit(perform_rpc, client, elm)
               for elm in self.elements]
    for f in futures:
      f.result()  # Finalize all the futures
    self.buffer = []

Performing a single RPC for a batch of elements

For most runners, a Batch pipeline has large bundles. This means that it makes sense to simply buffer elements as they come into process, and flush them every now and then, like so:

class BatchAndRpc(DoFn):
  def __init__(self):
    self.buffer = None
    self.client = None

  def process(self, element):
    self.buffer.append(element)
    if len(self.buffer) > MAX_BUFFER_SIZE:
      self._flush()

  def finish(self):
    self._flush()

  def _flush(self):
    if not self.client:
      self.client = create_client()
    perform_batched_rpc(client, self.buffer)
    self.buffer = []

For streaming pipelines, or for pipelines where your bundles are not large enough for this strategy to work well, you may need to try other tricks, but this strategy should be enough for most scenarios.

If these strategies don't work, please let me know, and I'll detail others.

Pablo
  • 10,425
  • 1
  • 44
  • 67
  • Thanks for the detailed write-up. My pipeline is using Java SDK so I will try to convert this in Java. Could you please clarify on the below two points, 1. How the asynchronous Single RPC and BatchedRPC differ as they two look similar to an extent in processing the request. 2. How would my batched_rpc should look like inside? Is my REST API (which I will call from my batched_rpc) itself should be able to access multiple request at once or does Beam handle that part? – Praveen Viswanathan Jul 09 '20 at 20:53