I have a Beam pipeline that queries BigQuery and then upload results to BigTable. I'd like to scale out my BigTable instance (from 1 to 10 nodes) before my pipeline starts and then scale back down (from 10 to 1 node) after the results are loaded in to BigTable. Is there any mechanism to do this with Beam?
I'd essentially like to either have two separate transforms one at the beginning of the pipeline and one at the end that scale up and down the nodes, respectively. Or, have a DoFn
that only triggers setup()
and teardown()
on one worker.
I've attempted to use the setup()
and teardown()
of the DoFn
lifecycle functions. But, these functions get executed once per worker (and I use hundreds of workers), so it will attempt to scale up and down BigTable multiple times (and hit the instance and cluster write quotas for the day). So that doesn't really work with my use case. In any case here's a snippet of a BigTableWriteFn I've been experimenting with:
class _BigTableWriteFn(beam.DoFn):
def __init__(self, project_id, instance_id, table_id, cluster_id, node_count):
beam.DoFn.__init__(self)
self.beam_options = {
'project_id': project_id,
'instance_id': instance_id,
'table_id': table_id,
'cluster_id': cluster_id,
'node_count': node_count
}
self.table = None
self.initial_node_count = None
self.batcher = None
self.written = Metrics.counter(self.__class__, 'Written Row')
def setup(self):
client = Client(project=self.beam_options['project_id'].get(), admin=True)
instance = client.instance(self.beam_options['instance_id'].get())
node_count = self.beam_options['node_count'].get()
cluster = instance.cluster(self.beam_options['cluster_id'].get())
self.initial_node_count = cluster.serve_nodes
if node_count != self.initial_node_count: # I realize this logic is flawed since the cluster.serve_nodes will change after the first setup() call, but I first thought setup() and teardown() was run once for the whole transform...
cluster.serve_nodes = node_count
cluster.update()
## other life cycle methods in between but aren't important to the question
def teardown(self):
client = Client(project=self.beam_options['project_id'].get(), admin=True)
instance = client.instance(self.beam_options['instance_id'].get())
cluster = instance.cluster(self.beam_options['cluster_id'].get())
if cluster.serve_nodes != self.initial_node_count: # I realize this logic is flawed since the cluster.serve_nodes will change after the first setup() call, but I first thought setup() and teardown() was run once for the whole transform...
cluster.serve_nodes = self.initial_node_count
cluster.update()
I'm also using RuntimeValueProvider parameters for the bigtable ids (project_id, instance_id, cluster_id, etc), so I feel whatever type of transform I do to scale I'll need to use a DoFn
.
Any help would be much appreciated!