7

We have a streaming pipeline that we have enabled autoscaling on. Generally, one worker is enough to process the incoming data, but we want to automatically increase the number of workers if there is a backlog.

Our pipeline reads from Pubsub, and writes batches to BigQuery using load jobs every 3 minutes. We ran this pipeline starting with one worker, publishing twice as much data to pubsub as one worker could consume. After 2 hours, autoscaling had still not kicked in, so the backlog would have been about 1 hour's worth of data. This seems rather poor given that autoscaling aims to keep the backlog under 10 seconds (according to this SO answer).

The document here says that autoscaling for streaming jobs is in beta, and is known to be course-grained if the sinks are high-latency. And yeah, I guess doing BigQuery batches every 3 minutes counts as high-latency! Is there any progress being made on improving this autoscaling algorithm?

Are there any work-arounds we can do in the meantime, such as measuring throughput at a different point in the pipeline? I couldn't find any documentation on how the throughput gets reported to the autoscaling system.

Chris Heath
  • 136
  • 1
  • 7

1 Answers1

1

The back log is created by unacknowledged messages, I guess that you are using pull subscriptions. If a message takes longer to process than to acknowledge, it will be resend as per the at-least-once delivery from Pub/Sub. And the only worker that will be able to process this message it the first one to have received it. No instance will be created in this case.

What you need to do is to tune your system to process the messages before the acknowledge deadline is expired. You may benefit by using push messages in some situation. I recommend to review this document regarding the backlog created by Pub/Sub.

Nathan Nasser
  • 1,008
  • 7
  • 18
  • 1
    Does that document about pub/sub backlog apply to Dataflow PubsubIO? I thought the Dataflow runner takes care of the acknowledgements. (And I hope it doesn't acknowledge them in 10MB bundles as the document would imply!) – Chris Heath Jul 06 '18 at 19:29
  • The document is regarding Pull Subscriptions. As you mentioned a backlog I know that you are using Pull Subscriptions and not push (correct me if I'm wrong). Dataflow runners will take care of acknowledging the messages but if it takes longer that the acknowledgement deadline, the message will be sent again and it will create a backlog. – Nathan Nasser Jul 09 '18 at 22:55
  • 1
    Yes, using pull. We are publishing 1Kb messages at 200 msgs/sec. One worker can process 100 msgs/sec. After 100 seconds, the 10MB user space buffer would be full, and it would take 100 secs to clear it. I assume it would be a good idea to increase our ack deadline to 200s to avoid data being re-sent. (Dataflow would deduplicate them, but that's just a waste of resources.) But would this improve the autoscaling? My goal is to make it automatically spawn a second worker so that it can process 200 msgs/sec. Does the output side (BigQuery batch loads) also affect autoscaling calculation? – Chris Heath Jul 11 '18 at 02:25
  • The Key signals for autoscaling are: CPU utilization, throughput and backlog. The source (pub/sub in this case) needs informs the Cloud Dataflow service about backlog with getSplitBacklogBytes() or getTotalBacklogBytes(). As this is in beta the Autoscaling works smoothest when reading from Cloud Pub/Sub subscriptions tied to topics published with small batches. Changing the Ack deadline will reduce the backlog for sure. Review this (same you share) [link](https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling) regarding the Streaming Autoscaling. – Nathan Nasser Jul 12 '18 at 17:32
  • We are publishing in small batches, but that document also says it works smoothest when the sink is smooth, so somehow the sink factors into the algorithm. Using my example 2 comments above: after 150 seconds, 30MB will be published, 15MB will be processed, 10MB will be in the user-space pub-sub queue, 5MB will be queued in the subscription, and zero MB will be in BigQuery (because our BQ batches run every 3 minutes). I think the backlog would be 5MB, but I'm not sure what values are passed to the getSplitBacklogBytes() or getTotalBacklogBytes(). Would throughput be 0 bytes/sec or 100kB/sec? – Chris Heath Jul 13 '18 at 17:19
  • The getSplitBacklogBytes() and getTotalBacklogBytes() methods return a long which is increased with the length of the incoming messages and decreased with the processed messages. [This](https://www.javatips.net/api/beam-master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#) is the best example I found online. Here is the [documentation](https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/io/UnboundedSource.UnboundedReader.html#getSplitBacklogBytes--). – Nathan Nasser Jul 16 '18 at 17:11