5

Nutshell

There is a Google Cloud PubSub topic which we publish 2M tiny/small messages to. We set up a Cloud Function to process messages coming through this topic. When inspecting the logs, we see that a lot of these messages are processed multiple times. All in all we have 150-200% the number of messages coming out that went in (see screenshot at the end).

The questions are: Why is this? And how do we properly configure things to get less duplicates?

Additional information

  • Right after the 2M messages are published and the script doing so is finished, the number of messages in the subscription appears to be correct. I think we can thus rule out "publish-side" duplication, marked as 1 and 2 in this article.
  • The config for the subscription is managed by Cloud Function and can not be edited by us. So it's the default/forced config for all Cloud Functions everywhere, I guess. See screenshot below.
  • The most common reason cited for duplication on the "subscriber-side" is functions taking longer than the Ack Deadline to acknowledge messages (e.g. here). The Ack Deadline on the subscription is set to 600s, but the function run time tops out at a couple of seconds max.
  • We also don't batch at all, which eliminates another reason often cited (e.g. in this question or this article).
  • The function is limited to 90 instances.

Update 15.03.22

So I wanted to be absolutely certain this is "true" duplication, by which I mean it's PubSub delivering messages/events multiple times, instead of duplication clumsily introduced by us somewhere. So I modified our function code to form a Datastore key from our external message ID and context.messageId and write it to the Datastore with a counter. In case the key already exists the counter is incremented. Right afterwards I log the entity. Here are the stats for 458,940 executions:

|-------------------|-------------------|
|      Counter      |       Logs        |
|-------------------|-------------------|
|         1         |       208,733     |
|         2         |       101,040     |
|         3         |       62,965      |
|         4         |       37,156      |
|         5         |       20,583      |
|         >5        |       28,463      |
|         >15       |       20          |
|         >20       |       0           |
|-------------------|-------------------|

My only (crummy) lead

The only theory I have right now is that the duplication is due to the underlying infrastructure responding with 429s because of the instance limit. See the following screenshot.

enter image description here

We don't really notice this anymore because we just filter the corresponding log messages out in the logging console by filtering for log_name, as the infrastructure throwing lots of warnings and errors seems to be expected behaviour. This is why I am also skeptical of this being the reason for the duplication. On the other hand it does look like the infrastructure could send NACKs in these cases.

Although it feels like the expected behaviour would be that the delivery of the original message fails with a 429 (but does not show up in our logs) and is is then re-delivered, possibly being duplicated. And this is not what we are observing, our logs do show duplicate executions.

So I am not sure this is a promising lead at all, but it's all I got right now.

Other Resources

I feel like what we are observing sounds similar to what is described in this question and the docs here, but then this is about "StreamingPull" subscriptions, whereas the GCF managed subscriptions seem to be (special?) "Push" subscriptions. I was excited at first, because it sounded so similar, but it seems it's not applicable.

Screenshots

Screenshot showing message duplication between topics:

enter image description here

Subscription config (managed by GCF):

enter image description here

Wilhelm Uschtrin
  • 337
  • 2
  • 12

1 Answers1

1

The 429’s errors are happening because there are too many requests or there is too much traffic with the instances. Also, your maximum Instances are 90, the suggested default is 3000.

Additionally, what you can do is to erase the maximum instances or set it to 0 if you don’t want to have any limits. On the other hand, you can add a higher maximum instance number from 90 to 180 and eventually a little bit higher to resolve your problems with the 429’s errors.

Other things that can cause this duplication behavior in your subscription are:

  • Messages in the subscriber’s queue are not being processed and ACK'd within the deadline;

  • The client library continues to (automatically) extend the deadline hoping to be able to process these messages eventually;

  • At some point, after several deadline extensions the client library gives up, so those messages are dropped and redelivered;

  • Messages in Pub/Sub are delivered in batches, so not only the expired messages will be redelivered but also the ones belonging to these messages' batches;

  • The subscriber will have its queue full of duplicates, which slows down backlog consumption;

  • The problem becomes worse and worse since current duplicates will expire as well and generate new duplicates.

Jose Gutierrez Paliza
  • 1,373
  • 1
  • 5
  • 12
  • Hey, thanks for the reply. I agree this sounds like what we are seeing, and it is also similar to the StreamingPull docs and question I linked to. But we are subscribed to the PubSub topic via a Push subscription managed by GCF. So we can neither modify the subscription settings, nor the subscriber client config. We are just served events one by one in our functions and that's it... unless I am missing something. Of course it could be that GCF uses a Pull subscription under the covers, but then how are we supposed to configure it properly? – Wilhelm Uschtrin Mar 15 '22 at 18:07
  • For pulls there are different types of pull that you can do please check this [documentation](https://cloud.google.com/pubsub/docs/pull). My recommendation dealing with the duplicates on the pull is to Use Cloud monitoring to monitor the acknowledged operations. To get this data, you need to select the `subscription/expired_ack_deadlines_count` metric. And to reduce the duplication rate, extend the message deadline through the client libraries. – Jose Gutierrez Paliza Mar 16 '22 at 16:02
  • Ah sorry, I am afraid I am not being clear enough: Unless I am missing something big, we can't switch to PULL subscription. We are using GCF, which creates a PUSH subscription, that is fully managed (read-only). I also can't find the metric you are referencing, these are the ones available to us in Cloud Monitoring: https://gist.github.com/uschtwill/0095d8080d78074ed028831d1c358284. Maybe it doesn't show because we don't have any PULL subscriptions at all? – Wilhelm Uschtrin Mar 16 '22 at 16:47
  • Yeah, you are correct. It doesn’t show it because you don’t have any PULL subscriptions. Internally Pub/Sub batches messages even when the published messages are unbatched. So, if one of your [GCFs Push fails](https://cloud.google.com/pubsub/docs/push#push_backoff) it could cause many other messages to be resent. Additionally is up to your GCF request to be [idempotent](https://developer.mozilla.org/en-US/docs/Glossary/Idempotent). – Jose Gutierrez Paliza Mar 17 '22 at 19:18
  • 1
    Are you sure regarding the batching? This could match my theory from the post, that it is due to the 429s. PubSub batches some messages, one (or more) of them fails with a 429 and the whole batch is redelivered -> duplicates. But how is this supposed to work then? I don't understand. This would mean you always get insane amounts of duplicates, once you limit the number of function instances and have a bit of a backlog building up.This can't be how it's supposed to be? I mean... our functions are idempotent, so we can handle it. But it's just wildly inefficient and inelegant, no? – Wilhelm Uschtrin Mar 19 '22 at 11:46
  • Yes, I’m sure about the batching. My guess is that this problem is happening through the subscription so, I’ll update my answer to let you know what are the things that could be happening with the subscription. There is nothing that you can do through your client side if the subscription is not configured correctly. – Jose Gutierrez Paliza Mar 21 '22 at 15:45
  • I see... and can you confirm two more things for me: 1) That we have no way to configure the subscription, as it is managed by GCF? And 2) that the messages are not ACKed because of the 429s happening because of the instance limit? Because 99,99999% of executions do log "Finished with status: ok"... – Wilhelm Uschtrin Mar 22 '22 at 10:35
  • Thank you for staying with me, by the way. :) – Wilhelm Uschtrin Mar 22 '22 at 10:36
  • 1) That's correct, there is no way to configure the subscription, managed by GCF. 2) That is one of the possible thing that can be happening and probably is the cause, but to determine if that is correct i would have to dig into the GCP project. – Jose Gutierrez Paliza Mar 22 '22 at 20:09
  • Okay, thanks for clarifying. In that case, I really am wondering how to mitigate the duplicates "GCP-natively"? Otherwise, as it stands, it seems to me that instance limits shouldn't be used when there is the possibility of a non-trivial backlog building up. Because even with idempotence, having these amounts of duplicates doesn't seem viable. Or we roll our own de-duplication layer with Datastore and messageIds? Please advise ... and you digging into the project, how would we go about that? – Wilhelm Uschtrin Mar 23 '22 at 12:38
  • 1)The recommended way to mitigate the duplicates is with [Dataflow](https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow#:~:text=Dataflow%20receives%20messages) by using the fused stage. 2)You can rikk your deduplication layer with Datastore or as I mentioned in the point 1 using Dataflow. 3) A coworker or I can dig into your project if you create a [Support case](https://cloud.google.com/support-hub). – Jose Gutierrez Paliza Mar 23 '22 at 15:45
  • Support case opened with ID: 29586188 – Wilhelm Uschtrin Mar 24 '22 at 12:41
  • Hello Jose, thanks for shedding light on why this is happening and what we can do about it. Could you update your answer removing the references to PULL type subscriptions (e.g. regarding the flow control settings)? That would probably make it more clear for posterity. And a bit of feedback: It would be super helpful, if the docs docs would explicitly state, that when long backlogs are a regular occurrence, instance limits should not be used with background/event-driven functions. Thinking about it, more generally and explicitly stating that in this answer would be awesome as well. – Wilhelm Uschtrin Mar 28 '22 at 10:31
  • We are using the instance limit as a form of rate limiting to control pressure on an API downstream, so we can not remove it. Moving the whole pipeline to Dataflow is also not feasible. So I will implement the deduplication logic in our own code, as discussed. By the way, this here is the place where I would put the heads-up in the docs: https://cloud.google.com/functions/docs/configuring/max-instances#setting_and_clearing_maximum_instance_limits. Thanks again for helping! :) – Wilhelm Uschtrin Mar 28 '22 at 10:33
  • @WilhelmUschtrin I updated the answer as you recommended, thanks. – Jose Gutierrez Paliza Apr 05 '22 at 14:31