avoid duplicate message from kafka consumer in kafka-python

Question

I have a unique id in my data and I am sending to kafka with kafka-python library. When I send samne data to kafka topic, it consumes same data anyway. Is there way to make kafka skip previous messages and contiunue from new messages.

def consume_from_kafka():
    consumer = KafkaConsumer(
        TOPIC,
        bootstrap_servers=["localhost"],
        group_id='my-group')

Reading [this](https://stackoverflow.com/questions/51799077/kafka-python-consumer-start-reading-from-offset-automatically) you seem to be missing the `auto_commit` flag. — Georgi Goranov, Apr 01 '22 at 11:12
@GeorgiGoranov I am making an example. Forexample I have data {id:1,name:"ok"},{id:2name:"null"},{id:3,name:"zero"} If I send to kafka it reads and writes. It is fine. But When I run it again it sends same messages again to db — newUser, Apr 01 '22 at 11:37
Like he said, you're not committing any consumed offsets, so the consumer will restart at the previous position — OneCricketeer, Apr 01 '22 at 13:19
@OneCricketeer I am not talking about commiting or not. if you commit message I know when you start consumer it does not consumes same data I know it. If you send message to kafka with producer kafka still consumes same data it is normal. But If you sent the same data how kafka will know it is receiving same data. You are answering me if I run consumer I get same data question. But I am not asking that. — newUser, Apr 01 '22 at 13:34
Kafka producers have no idea you're sending the broker duplicate data. It doesn't care. You'll need to implement this on your own, outside of Kafka APIs — OneCricketeer, Apr 01 '22 at 13:35
hell yes I know it mate but I am asking in the question is there a way to prevent duplicate message from consumer or topic. even if I send same data to kafka topic. for example id checking or someting? @OneCricketeer — newUser, Apr 01 '22 at 13:37
Like I said, Kafka doesn't care. You must consume all events from the topic, from the last committed offset. **Then** you can parse out the ID from the record and _query the database_ to see if they should be processed or not — OneCricketeer, Apr 01 '22 at 13:38
That's correct. It's a log, not an indexed database. If you write the same data to a log file, it's still the reader (or writer) responsibility to track what has been written so far in order to detect duplicates. More specifically, the broker doesn't inspect the data you send, and the offsets are always uniquely increasing — OneCricketeer, Apr 01 '22 at 13:50
yes mate @OneCricketeer I made a development like that before. holding unique id or last unique id in a table and checking but it takes time after a while sending data to topic. because of checking id. — newUser, Apr 01 '22 at 14:32

score 0 · Answer 1 · answered Apr 02 '22 at 12:20

Ok, I finally got your question. Avoiding a message that has been sent multiple times by a producer (incidentally) could be very complicated.

There are generally 2 cases:

The simple one where you have a single instance that consumes the messages. In that case your producer can add a uuid to the message payload and your consumer can keep the ids of the processed messages in a in memory cache.
The complicated one is where you have multiple instances that consume messages (that is usually why you'd need message brokers - a distributed system). In this scenario you would need to use an external service that would play the role of the distributed cache. Redis is a good choice. Alternatively you can use a relational database (which you probably already have in your stack) and record processed message ids there.

Hope that helps.

Even if you use a memory cache, that's going to be lost if/when the consumer restarts. You'd need a persistent store, regardless of number of instances to truly prevent all duplicates — OneCricketeer, Apr 05 '22 at 13:56

score 0 · Answer 2 · answered Aug 16 '22 at 12:02

Someone might need this here. I solved the duplicate message problem using the code below; I am using the Kafka-python lib.

consumer = KafkaConsumer('TOPIC', bootstrap_servers=KAFKA,
                     auto_offset_reset='earliest', enable_auto_commit=True,
                     auto_commit_interval_ms=1000, group_id='my-group')

avoid duplicate message from kafka consumer in kafka-python

2 Answers2