0

I have a very simple streaming pipeline that reads from Pub/Sub, runs inference on a Tensorflow model, and then writes the result back to Pub/Sub:

    with beam.Pipeline(options=pipeline_options) as pipeline:
        pipeline = (
            pipeline
            | 'PSRead' >>
            beam.io.ReadFromPubSub(
                subscription=read_subscription_name,
                with_attributes=True,
                id_label='message_id'
            )
            | 'RunModel' >> RunInference(ModelHandler())
            | 'PSWrite' >> beam.io.WriteToPubSub(write_topic_name, with_attributes=True)
        )

Ideally, this pipeline would leverage Dataflow's Autoscale feature to just be a simple scalable work queue: when there is a backlog of items to inference, add more workers and fire up a copy of this entire pipeline against the same subscription on each one, and compete for work items until the queue is empty, then scale back down. I cannot seem to get this to autoscale up at all from 1 worker, however, and I'm wondering how I should expect this to work with beam.io.ReadFromPubSub as my source. The documentation for both dataflow and beam is pretty unclear on this, but I think I should be assigning keys somehow to the messages that come out of beam.io.ReadFromPubSub (because keys determine parallelism? I really don't understand this...). If that's the case, how do I do that? Is each message already its own key? Will I be able to get the actual beam.io.ReadFromPubSub connector to scale to each worker, or should I expect one instance of the connector and for the RunModel step only to scale to multiple workers?

Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
Rob Allsopp
  • 3,309
  • 5
  • 34
  • 53
  • Does this [link1](https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#efficient_deduplication),[link2](https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#exactly-once_delivery) and [link3](https://stackoverflow.com/questions/51088518/) help you? – kiran mathew May 03 '23 at 14:33
  • Thanks for the comment, but no those aren't helpful. I'm not having issues with deduplication or the message backlog. I really want to understand if its even possible to get the actual ReadFromPubSub part of the pipeline to scale, or if it always puts one instance of that on one worker and only tries to scale the rest of the pipeline. If it is possible, how do I actually get it to scale? – Rob Allsopp May 06 '23 at 14:43
  • Hi @Rob Allsopp, Based on my understanding, If a specific part of your pipeline may be computationally heavier than others, the Dataflow service may automatically spin up additional workers during these phases of your job. For more information about scaling you can check this [link1](https://medium.com/@raigonjolly/dataflow-for-google-cloud-professional-data-exam-9efd59377068) and [link2](https://cloud.google.com/dataflow/docs/horizontal-autoscaling#streaming). – kiran mathew May 12 '23 at 11:38

0 Answers0