Even if you use the idempotent producer, you still need to do the idempotent work in consumer , so what is the use of the idempotent producer in kafka? just to reduce the duplication in broker, saving the storage ?
3 Answers
There's transactional markers added in the record batches, so no, it's not really saving storage when you could always set max in flight connections = 1, and disable producer retries to "prevent duplicates" instead.
The use-case is primarily to allow exactly once processing. If you don't need it, feel free to disable it...
BTW, the latest version of Kafka clients default to enable idempotent producer, so you don't really need to worry about it.

- 179,855
- 19
- 132
- 245
The idempotent producer ensures records are delivered exactly once and in order per partition for the lifetime of the producer instance.
It means that even if there are retries due to flaky network or any other errors, there will not be duplicates within each partitions and the order of records will be preserved.
Records produced using an idempotent producer have the exact same size on disk as records from a non idempotent regular producer.
So overall the idempotent producer provides better delivery semantics than a regular producer while having no negative impact on performance or disk utilization. This is why since Kafka 3.0, idempotency is enabled by default.

- 25,067
- 7
- 71
- 68
You're correct in that consumers should still be idempotent when using the idempotent producer, but the idempotent producer does solve a real problem - messages reordering/duplication in the log.
A consumer processing a "duplicate" message falls into one of two cases:
- Two messages with different offsets contain the same data (i.e. the producer called produce() once, but the message was written to the log twice)
- The same offset is processed twice
The idempotent producer avoids case 1. At a high level, the idempotent producer avoids messages from being appended to the log multiple times or out of order from a single producer. With idempotence enabled, consumers only have to worry about case 2.
Before the idempotent producer, the only way to avoid messages getting reordered/duplicated in the log (typically due to request retries) was to set max.in.flight.requests.per.connection=1
in the producer's config, which could severely limit throughput. The idempotent producer allows for up to 5 in-flight requests from the producer.

- 191
- 9
-
thx for answer . but what the purpose of avoiding duplication in log ? whatever it duplicates or not , consumer still do the same thing : use a unique key to do the idempotent work – haipeng zou Oct 19 '22 at 03:17
-
Consumers could use a unique key of some sort to achieve idemoptent processing, but without duplicates in the log, the offset within a partition can now be treated as a unique message identifier, too. But that's only half of the benefit - having messages appended to the log in the order the producer sent them (regardless of retries) is often a requirement and the idempotent producer provides that semantic with little overhead (as Mickael detailed). – Chris Beard Oct 19 '22 at 13:30