I am new to GCP and while reading the documentation about Auto-tuning by Dataflow service they are talking about backlog and auto-scaling that depends on it. In this particular case what is backlog? If my pipeline is reading from a pub/sub, is it the age of oldest message or the number of unacknowledged messages?
Asked
Active
Viewed 865 times
1 Answers
2
Backlogs in Dataflow aren't related to PubSub. Dataflow always get a message from PubSub when it is here. But the processing queue can increase internally in Dataflow: that is the backlogs. If it's too big, and the CPU consumption too high a new worker is added to the pipeline.
In streaming mode, you still have backlog, but you also have a predictive backlog. In fact, it compare the number of message in each time windows and if the number of message increase that can be the beginning of a spike and dataflow can scale up proactively.

guillaume blaquiere
- 66,369
- 2
- 47
- 76
-
@guillaume_blaquiere Thanks for the explanation. I understood what you have said except the second sentence. What do you mean by "Dataflow always get a message from PubSub when it is here"? – lookout May 28 '21 at 14:32
-
Excuse my english ;) Dataflow create a pull connexion to PubSub and get the messages immediately. You haven't backlog in the PubSub subscription, the subscription is normally empty – guillaume blaquiere May 28 '21 at 15:13