Kafka is trying to send messages to a broker in "recovery mode"

Question

I have the following setup

3 Kafka (v2.1.1) Brokers 5 Zookeeper instances

Kafka brokers have the following configuration:

      auto.create.topics.enable: 'false'
      default.replication.factor: 1
      delete.topic.enable: 'false'
      log.cleaner.threads: 1
      log.message.format.version: '2.1'
      log.retention.hours: 168
      num.partitions: 1
      offsets.topic.replication.factor: 1
      transaction.state.log.min.isr: '2'
      transaction.state.log.replication.factor: '3'
      zookeeper.connection.timeout.ms: 10000
      zookeeper.session.timeout.ms: 10000
      min.insync.replicas: '2'
      request.timeout.ms: 30000

Producer configuration (using Spring Kafka) is more or less as following:

...
acks: all
retries: Integer.MAX_VALUE
deployment.timeout.ms: 360000ms
enable.idempotence: true
...

This configuration I read as follows: There are three Kafka brokers, but once one of them dies, it is fine if only at least two replicate and persist the data before sending the ack back (= in sync replicas). In case of failure, Kafka producer will keep retrying for 6 minutes, but then gives up.

This is the scenario which causes me headache:

All Kafka and Zookeeper instances are up and alive
I start sending messages in chunks (500 pcs each)
In the middle of the processing, one of the Brokers dies (hard kill)
Immediately, I see logs like 2019-08-09 13:06:39.805 WARN 1 --- [b6b45bb5c-7dxh7] o.a.k.c.NetworkClient : [Producer clientId=bla-6b6b45bb5c-7dxh7, transactionalId=bla-6b6b45bb5c-7dxh70] 4 partitions have leader brokers without a matching listener, including [...] (question 1: I do not see any further messages coming in, does this really mean the whole cluster is now stuck and waiting for the dead Broker to come back???)
After the dead Broker starts to boot up again, it starts with recovery of the corrupted index. This operation takes more than 10 minutes as I have a lot of data on the Kafka cluster
Every 30s, the producer tries to send the message again (due to request.timeout.ms property set to 30s)
Since my deployment.timeout.ms is se to 6 minutes and the Broker needs 10 minutes to recover and does not persist the data until then, the producer gives up and stops retrying = I potentially lose the data

The questions are

Why the Kafka cluster waits until the dead Broker comes back?
When the producer realizes the Broker does not respond, why it does not try to connect another Broker?
The thread is completely stuck for 6 minutes and waiting until the dead Broker recovers, how can I tell the producer to rather try another Broker?
Am I missing something or is there any good practice to avoid such scenario?

Shawn · Answer 1 · 2019-08-09T16:17:46.913

You have a number of questions, I'll take a shot at providing our experience which will hopefully shed light on some of them.

In my product, IBM IDR Replication, we had to provide information for robustness to customers who's topics were being rebalanced, or whom had lost a broker in their clusters. The results of some of our testing was the simply setting the request timeout was not sufficient because in certain circumstances the request would decide not to wait the entire time, and rather perform another retry almost instantly. This burned through the configured number of retries Ie. there are circumstances where the timeout period is circumvented.

As such we instructed users to utilize a formula like the following...

https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/robust.html

"To tune the values for your environment, adjust the Kafka producer properties retry.backoff.ms and retries according to the following formula: retry.backoff.ms * retries > the anticipated maximum time for leader change metadata to propagate in the clusterCopy For example, you might wish to configure retry.backoff.ms=300, retries=150 and max.in.flight.requests.per.connection=1."

So maybe try utilizing retries and retry.backoff.ms. Note that utilizing retries without idempotence can cause batches to be written out of order if you have more than one in flight... so choose accordingly based on your business logic.

It was our experience that the Kafka Producer writes to the broker which is the leader for the topic, and so you have to wait for the new leader to be elected. When it is, if the retry process is still ongoing, the producer transparently determines the new leader and writes data accordingly.

Kafka is trying to send messages to a broker in "recovery mode"

1 Answers1