I'm running an analytics pipeline.
- Throughput is ~11 messages per second.
- My Pub/Sub topic holds around 2M messages scheduled.
- 80 GCE instances are pulling messages in parallel.
Here is my topic and the subscription:
gcloud pubsub topics create pipeline-input
gcloud beta pubsub subscriptions create pipeline-input-sub \
--topic pipeline-input \
--ack-deadline 600 \
--expiration-period never \
--dead-letter-topic dead-letter
Here is how I pull messages:
import { PubSub, Message } from '@google-cloud/pubsub'
const pubSubClient = new PubSub()
const queue: Message[] = []
const populateQueue = async () => {
const subscription = pubSubClient.subscription('pipeline-input-sub', {
flowControl: {
maxMessages: 5
}
})
const messageHandler = async (message: Message) => {
queue.push(message)
}
subscription.on('message', messageHandler)
}
const processQueueMessage = () => {
const message = queue.shift()
try {
...
message.ack()
} catch {
...
message.nack()
}
processQueueMessage()
}
processQueueMessage()
Processing time is ~7 seconds.
Here is one of the many similar dup cases. The same message is delivered 5 (!!!) times to different GCE instances:
- 03:37:42.377
- 03:45:20.883
- 03:48:14.262
- 04:01:33.848
- 05:57:45.141
All 5 times the message was successfully processed and .ack()
ed. The output includes 50% more messages than the input! I'm well aware of the "at least once" behavior, but I thought it may duplicate like 0.01% of messages, not 50% of them.
The topic input is 100% free of duplicates. I verified both the topic input method AND the number of un-acked messages through the Cloud Monitor. Numbers match: there are no duplicates in the pub/sub topic.
UPDATE:
- It looks like all those duplicates created due to ack deadline expiration. I'm 100% sure that I'm acknowledging 99.9% of messages before the 600 seconds deadline.