Setting up the answer
I will help by writing example DoFns that do the things you want. I will write them in Python, but the Java code would be similar.
Let's suppose we have two functions that internally perform an RPC:
def perform_rpc(client, element):
.... # Some Code to run one RPC for the element using the client
def perform_batched_rpc(client, elements):
.... # Some code that runs a single RPC for a batch of elements using the client
Let's also suppose that you have a function create_client()
that returns a client for your external system. We assume that creating this client is somewhat expensive, and it is not possible to maintain many clients in a single worker (due to e.g. memory constraints)
Performing a single RPC per element
It is usually fine to perform a blocking RPC for each element, but this may lead to low CPU usage
class IndividualBlockingRpc(DoFn):
def setup(self):
# Create the client only once per Fn instance
self.client = create_client()
def process(self, element):
perform_rpc(self.client, element)
If you would like to be more sophisticated, you could also try to run asynchronous RPCs by buffering. Consider that in this case, your client would need to be thread safe:
class AsyncRpcs(DoFn):
def __init__(self):
self.buffer = None
self.client = None
def process(self, element):
self.buffer.append(element)
if len(self.buffer) > MAX_BUFFER_SIZE:
self._flush()
def finish(self):
self._flush()
def _flush(self):
if not self.client:
self.client = create_client()
if not self.executor:
self.executor = ThreadPoolExecutor() # Use a configured executor
futures = [self.executor.submit(perform_rpc, client, elm)
for elm in self.elements]
for f in futures:
f.result() # Finalize all the futures
self.buffer = []
Performing a single RPC for a batch of elements
For most runners, a Batch pipeline has large bundles. This means that it makes sense to simply buffer elements as they come into process
, and flush them every now and then, like so:
class BatchAndRpc(DoFn):
def __init__(self):
self.buffer = None
self.client = None
def process(self, element):
self.buffer.append(element)
if len(self.buffer) > MAX_BUFFER_SIZE:
self._flush()
def finish(self):
self._flush()
def _flush(self):
if not self.client:
self.client = create_client()
perform_batched_rpc(client, self.buffer)
self.buffer = []
For streaming pipelines, or for pipelines where your bundles are not large enough for this strategy to work well, you may need to try other tricks, but this strategy should be enough for most scenarios.
If these strategies don't work, please let me know, and I'll detail others.