A Linux C++ application is used for writing to Kafka (using librdkafka).
I have to write a set of messages sequentially(orderly without gaps) to a Kafka topic. There is no transaction requirement, hence, transactional producer is NOT used; enable.idempotence=true
is used.
When a Kafka fatal error is detected: (librdkafka) Producer object is deleted -> wait for 15 seconds -> create a new Consumer object and a new Producer object -> read(using Consumer object) last written message to the topic -> resume writing from last written msg onwards [if any of these actions fail, re-try whole thing after 15 seconds]
On Producer (Linux) process re-start scenario: create a new consumer object and a new producer object -> read (using consumer object) last written message to the topic -> resume writing from last written msg onwards [if any of these actions fail, re-try whole thing after 15 seconds]
Assume below scenario: need to write
|msg with Internal sequence-100||msg with Internal sequence-101||msg with Internal sequence-102|msg with Internal sequence-103|...
- Producer process inserted(sequentially) seq-100, seq-101, seq-102 and seq-103 messages to the Producer (i.e. produce called)
- Kafka cluster shutdown and re-started after about 5 mins. Meanwhile Fatal error detected by producer application (due to delivery timeout) and above mentioned recovery procedure("When a Kafka fatal error is detected") performed continuously.
- When last written msg is requested (by above Linux process) seq-100 msg is returned becoz seq-101 to seq-103 are in Kafka cluster internal queues (since Cluster is re-starting).
- Since last written msg is seq-100 msg, producer (Linux)application writes messages seq-101 -> seq-102 -> seq-103.
- Kafka cluster writes the msgs seq-101 to seq-103 messages that was in internal queues. After that it writes the seq-101 to seq-103 msgs that was produced in above step.
- This results in seq-101 to seq-103 messages duplicated in kafka topic.
How can I avoid duplicate messages in this case ?
One solution is to insert a special msg (which should be ignored by kafka readers) before reading and make sure (if not re-try) it is returned as the last msg (so that internal queue flushed assumption can be taken). But this requires asking from kafka readers to ignore the "special" msgs, which is NOT preferred.
Can transactional producer be used to avoid this ? if it is a solution (we need to support very high message rates like 50,000 msgs per second). Can transactional producer support high message rates like 50,000 msgs per second ? Is commit transaction (in librdkafka) a blocking call (i.e. calling thread blocked until cluster acknowledgement) ?
Are there any other solutions to avoid the duplication?