dealing with Kafka's exactly once processing edge-cases

Question

Folks, Trying to do a POC for processing messages using Kafka for an implementation which absolutely requires only once processing. Example: as a payment system, process a credit card transaction only once

What edge cases should we protect against?

One failure scenario covered here is:

1.) If a consumer fails, and does not commit that it has read through a particular offset, the message will be read again.

Lets say consumers live in Kubernetes pods, and one of the hosts goes offline. We will potentially have messages that have been processed, but not marked as processed in Kafka before the pods went away due to underlying hardware issue. Do i understand this error scenario correctly?

Are there other failure scenarios which we need to fully understand on the producer/consumer side when thinking of Kafka doing only-once processing?

Thanks!

score 1 · Accepted Answer · answered Sep 12 '19 at 05:57

im going to basically repeat and exand on an answer i gave here:

a few scenarios can result in duplication:

consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.

there are also scenarios that end in data loss - look up "unclean leader election" (disabling that trades with availability).

also - kafka "exactly once" configurations only work if all you inputs, outputs, and side effects happen on the same kafka cluster. which often makes it of limited use in real life.

there are a few kafka features you could try using to reduce the likelihood of this happening to you:

set enable.idempotence to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead
use transactions when producing - incurs overhead and adds latency
set transactional.id on the producer in case your fail over across machines - gets complicated to manage at scale
set isolation.level to read_committed on the consumer - adds latency (needs to be done in combination with 2 above)
shorten auto.commit.interval.ms on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.

I have to say that as someone who's been maintaining a VERY large kafka installation for the past few years I'd never use a bank that relied on kafka for its core transaction processing though ...

dealing with Kafka's exactly once processing edge-cases

1 Answers1