Why am I getting 50% of GCP Pub/Sub messages duplicated?

Question

I'm running an analytics pipeline.

Throughput is ~11 messages per second.
My Pub/Sub topic holds around 2M messages scheduled.
80 GCE instances are pulling messages in parallel.

Here is my topic and the subscription:

gcloud pubsub topics create pipeline-input

gcloud beta pubsub subscriptions create pipeline-input-sub \
    --topic pipeline-input \
    --ack-deadline 600 \
    --expiration-period never \
    --dead-letter-topic dead-letter

Here is how I pull messages:

import { PubSub, Message } from '@google-cloud/pubsub'

const pubSubClient = new PubSub()

const queue: Message[] = []

const populateQueue = async () => {
  const subscription = pubSubClient.subscription('pipeline-input-sub', {
    flowControl: {
      maxMessages: 5
    }
  })
  const messageHandler = async (message: Message) => {
    queue.push(message)
  }
  subscription.on('message', messageHandler)
}

const processQueueMessage = () => {
  const message = queue.shift()
  try {
    ...
    message.ack()
  } catch {
    ...
    message.nack()
  }
  processQueueMessage()
}

processQueueMessage()

Processing time is ~7 seconds.

Here is one of the many similar dup cases. The same message is delivered 5 (!!!) times to different GCE instances:

03:37:42.377
03:45:20.883
03:48:14.262
04:01:33.848
05:57:45.141

All 5 times the message was successfully processed and .ack()ed. The output includes 50% more messages than the input! I'm well aware of the "at least once" behavior, but I thought it may duplicate like 0.01% of messages, not 50% of them.

The topic input is 100% free of duplicates. I verified both the topic input method AND the number of un-acked messages through the Cloud Monitor. Numbers match: there are no duplicates in the pub/sub topic.

UPDATE:

It looks like all those duplicates created due to ack deadline expiration. I'm 100% sure that I'm acknowledging 99.9% of messages before the 600 seconds deadline.

Kamal Aboul-Hosn · Accepted Answer · 2021-10-05T13:50:22.253

Some duplicates are expected, though a 50% duplicate rate is definitely high. The first question is, are these publish-side duplicates or subscribe-side duplicates? The former are created when a publish of the same message is retried, resulting in multiple publishes of the same message. These messages will have different message IDs. The latter is caused by redeliveries of the same message to the subscriber. These messages have the same message ID (though different ack IDs).

It sounds like you have verified that these are subscribe-side duplicates. Therefore, the likely cause, as you mention is an expired ack deadline. The question is, why are the messages exceeding the ack deadline? One thing to note is that when using the client library, the ack deadline set in the subscription is not the one used. Instead, the client library tries to optimize ack deadlines based on client library settings and the 99th percentile ack latency. It then renews leases on messages until the max_lease_duration property of the FlowControl object passed into the subscribe method. This defaults to one hour.

Therefore, in order for messages to remain leased, it is necessary for the client library to be able to send modifyAckDeadline requests to the server. One possible cause of duplicates would be the inability of the client to send these requests, possibly due to overload on the machine. Are the machines running this pipeline doing any other work? If so, it is possible they are overloaded in terms of CPU, memory, or network and are unable to send the modifyAckDeadline requests and unable to process messages in a timely fashion.

It is also possible that message batching could be affecting your ability to ack messages. As an optimization, the Pub/Sub system stores acknowledgements for batches of messages instead of individual messages. As a result, all messages in a batch must be acknowledged in order for all of them to be acknowledged. Therefore, if you have five messages in a batch and acknowledge four of them, but then do not ack the final message, all five will be redelivered. There are some caches in place to try to minimize this, but it is still a possibility. There is a Medium post that discusses this in more detail (see the "Message Redelivery & Duplication Rate" section). It might be worth checking that all messages are acked and not nacked in your code by printing out the message ID as soon as the message is received and right before the calls to ack and nack. If your messages were published in batches, it is possible that a single nack is causing redelivery of more messages.

This coupling between batching and duplicates is something we are actively working on improving. I would expect this issue to stop at some point. In the meantime, if you have control over the publisher, you could set the max_messages property in the batch settings to 1 to prevent the batching of messages.

If none of that helps, it would be best to open up a support case and provide the project name, subscription name, and message IDs of some duplicated messages. Engineers can investigate in more detail why individual messages are getting redelivered.

I'm nack-ing a lot of messages based on the processing result. The idea is to send them back to the queue and try to process them again later. Based on what you are saying, this is clearly the reason why I have all those dups. I'm nack-ing one message and bunch of other messages get nack-ed consequently. I wonder why this is a part of Pub/Sub design? — stkvtflw, Oct 05 '21 at 12:25
It is actually something we are actively working on improving. I would expect this issue to stop at some point. In the meantime, if you have control over the publisher, you could set the [`max_messages`](https://github.com/googleapis/python-pubsub/blob/093cabff9f0464b1dfaa8f373b6fffbc439518de/google/cloud/pubsub_v1/types.py#L143) property to 1 to prevent the batching of messages. I'll update my answer with this as well. — Kamal Aboul-Hosn, Oct 05 '21 at 13:36
I mean setting maxMessages on the publish side, not the subscribe side. Sorry, my link was to the wrong place. [This is the maxMessages you want to set](https://cloud.google.com/pubsub/docs/publisher#batching). Updated my answer with the better link. — Kamal Aboul-Hosn, Oct 05 '21 at 14:15
yeah, figured it. Will it affect the publisher throughput though? — stkvtflw, Oct 05 '21 at 14:17
looks like it will... and it will affect the subscriber throughput as well... that's a pretty big trade off — stkvtflw, Oct 05 '21 at 14:19
Yeah, it is unfortunately the only option right now short of republishing the messages and acking the original deliveries instead of nacking them. Like I said, we are actively working on changes to improve this and have a plan to roll it out, tentatively in Q1. — Kamal Aboul-Hosn, Oct 05 '21 at 16:08
I think Kamal provided a very good description. I don't think I can solve the original issue, but I usually use a workaround - keeping the state of the process in a Firestore database - a unique 'document' for every unique message. Not ideal, and has overhead, but it works for me ( I mainly work with cloud functions). An example of the idea: https://stackoverflow.com/questions/66050709/google-cloud-platform-solution-for-serverless-log-ingestion-files-downloading/66050710#66050710 and: https://stackoverflow.com/questions/66921764/google-pub-sub-to-control-ratelimit/66922977#66922977 — al-dann, Oct 07 '21 at 16:42

Why am I getting 50% of GCP Pub/Sub messages duplicated?

1 Answers1

Linked