Kafka docs Producer possible message loss

Question

I'm currently learning more about the Kafka Producer. I am a bit puzzled by the following paragraph from the docs:

Messages written to the partition leader are not immediately readable by consumers regardless of the producer’s acknowledgement settings. When all in-sync replicas have acknowledged the write, then the message is considered committed, which makes it available for reading. This ensures that messages cannot be lost by a broker failure after they have already been read. Note that this implies that messages which were acknowledged by the leader only (that is, acks=1) can be lost if the partition leader fails before the replicas have copied the message. Nevertheless, this is often a reasonable compromise in practice to ensure durability in most cases while not impacting throughput too significantly.

The way I interpret this is that messages can get lost during the sync between leader and replicated brokers, i.e. messages won't be committed unless they have been successfully replicated.

I don't understand how (for example) the Java application can shield against this message loss. Does it receive different acknowledgements between 'only-leader' and the full replication?

this is often a reasonable compromise in practice

How is that? Do they assume that you should log failed messages and re-queue them manually? Or how does that work?

score 1 · Accepted Answer · answered Jan 18 '21 at 12:10

1

"Does it receive different acknowledgements between 'only-leader' and the full replication?"

There is no difference between a leader and replica acknowledgment. You only steer the behavior of the producer through its configuration acks. If it is set to 1 it will wait only for the leader acknowledgment, if you set it to all it will wait for all replicas (based on the replication factor of the topic) before the producer considers writing the message as successful.

If you set acks=all and the synchronisation between leader and replicas fail, your producer will receive a retriable Exception (either "NotEnoughReplicasException" or "NotEnoughReplicasAfterAppendException", see more details here). Based on the producer configuration retries it will try to re-send the message. Kafka is built in a way that it expects crashed brokers to be available again (in a "short" amount of time).

In case you have set acks=1 and the synchronisation between leader and replicas fail, your producer considers the message was successfully written to the cluster and it will not try to reproduce the message. Of course the leader will continue to replicate the message to its replicas. But it is not really guaranteed that this will happen. And before the message got replicated the leader broker itself could have issues which will cause the message to be lost forever.

answered Jan 18 '21 at 12:10

Michael Heil

16,250
3
42
77

So, from your explanation I understand that there is no guaranteed delivery unless you set ack to 'all'. I don't see how not setting it to 'all' is an acceptable tradeoff if you don't want to lose messages (?). It's like a database that just loses an insert while acknowledging, doesn't seem acceptable to me at all. Sounds just weird to me tbh. Do you have some insights in the overhead of waiting for all replicas to acknowledge? I want to use the sink connector for piping data from a REST api into MongoDB. – html_programmer Jan 18 '21 at 12:32
Note: I can see how this is acceptable for big data, however my use case is to catch spikes in user uploading data during certain times. In this case data should not get lost as it concerns user data. Seems I need 'acks' set to 'all', but only if I can get the benefit of fast ack back to the client. If that makes sense to you. Seems like a common use case. – html_programmer Jan 18 '21 at 12:42
I understand your view point on this and your understanding is correct that you are loosing some degree of delivery guarantee when setting acks=1 or even 0 (which would be the "fire-and-forget" mode). In a lot of Big Data use cases using distributed systems it is often not a big issue if you loose one or the other message. If you are only interested in the data of, say the last few minutes, the few lost messages will have no impact after that few minutes. – Michael Heil Jan 18 '21 at 13:05
1

Although, in one of my last use cases it was mandatory to not loose any message, I still set the acks=1 because I had to process a lot of data at once. Depending on your cluster setup, network bandwith, traffic, server location and others, you will get a real performance benefit when reducing the acks from all to 1. In our case it was a factor of around 5 having some cluster about 50km in distance of each other. But of course, I cannot advise on the potential performance benefit on your case without knowing all mentioned details... I guess you just need to test it. – Michael Heil Jan 18 '21 at 13:09
By the way, you you could also use "transactions" in Kafka to ensure that you are not loosing any data. – Michael Heil Jan 18 '21 at 13:10

Kafka docs Producer possible message loss

1 Answers1