Kafka having duplicate messages

Question

I don't see any failure while producing or consuming the data however there are bunch of duplicate messages in production. For a small topic which gets around 100k messages, there are ~4k duplicates though like I said there is no failure and on top of that there is no retry logic implemented or config value is set.

I also check offset values for those duplicate messages and each has distinct values which tells me that the issue is in producer.

Any help would be highly appreciated

score 6 · Answer 1 · answered Dec 02 '15 at 07:56

6

Read more about message delivery in kafka:

https://kafka.apache.org/08/design.html#semantics

So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.

Probably you are looking for "exactly once delivery" like in jms

https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactly-oncemessagingfromKafka?

There are two approaches to getting exactly once semantics during data production: 1. Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded 2. Include a primary key (UUID or something) in the message and deduplicate on the consumer.

We implemented second point in our systems.

answered Dec 02 '15 at 07:56

Anatoly Deyneka

1,238
8
13

Thank you Anatoly for the answer. You picked the second solution but then you wouldn't expect to have an overhead of having deduping while consuming the data. We will have larger datasets ike 50k messages per second and if we go with dedup in consumer then I will have to maintain a hash for each unique uuid which I expect big impact on processing. – East2West Dec 08 '15 at 17:41
50k per sec is serious load for consumer. You can test first and second solution for your usecase or wait for future releases. "Apache Kafka community plans to focus on operational simplicity and stronger delivery guarantees. This work includes automated data balancing, more security enhancements, and support for exactly-once delivery in Kafka" – Anatoly Deyneka Dec 09 '15 at 07:41
@AnatolyDeyneka : Do you have any idea, how to implement single writer per partition? – Shankar Aug 29 '16 at 17:29
1

@AnatolyDeyneka - You should totally write a blog about this. I mean explaining how does it work theoretically and then some code sample for others to try as a reference. – JR ibkr Jan 29 '19 at 15:56

Kafka having duplicate messages

1 Answers1

Linked