Kafka producer fails to send messages with NOT_LEADER_FOR_PARTITION exception

Question

We're using spring-cloud-stream-binder-kafka (3.0.3.RELEASE) to send messages to our Kafka cluster (2.4.1). Every now and then one of the producer threads receives NOT_LEADER_FOR_PARTITION exceptions, and even exceeds the retries (currently set at 12, activated by dependency spring-retry). We've restricted the retries because we're sending about 1k msg/s (per producer instance) and were worried about the size of the buffer. This way we're regularly loosing messages, which is bad for downstream consumers, because we can't simply reproduce the incoming traffic.

The error message is


[Producer clientId=producer-5] Received invalid metadata error in produce request on partition topic-21 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now
[Producer clientId=producer-5] Got error produce response with correlation id 974706 on topic-partition topic-21, retrying (8 attempts left). Error: NOT_LEADER_FOR_PARTITION
[Producer clientId=producer-5] Got error produce response with correlation id 974707 on topic-partition topic-21, retrying (1 attempts left). Error: NOT_LEADER_FOR_PARTITION

Any known way to avoid this? Should we go back to the default of MAX_INT retries? Why does it keep sending to the same broker, even though it responded with NOT_LEADER_FOR_PARTITION?

Any hints are welcome.

EDIT: We just noticed that the broker metric kafka_network_requestmetrics_responsequeuetimems goes up around that time, but the max we've seen is around 2.5s

Can you update what you found out? – Usul Nov 15 '21 at 10:24 — Usul, Nov 15 '21 at 10:24

Rohit Yadav · Answer 1 · 2020-05-15T16:13:52.067

Both Produce and Fetch requests are send to the leader replica of the partition. NotLeaderForPartitionException the exception is thrown when the request is sent to the partition which not the leader replica of the partition now.

The client maintains the information regarding the leader of each partition as a cache. The complete process of cache management is shown below.

The client needs to refresh this information by setting the metadata.max.age.ms in producer configuration. The default value of this tag is 300000 ms

You can go through the following Apache Kafka documentation.

https://kafka.apache.org/documentation/

Please go through the Sender.java code.

https://github.com/a0x8o/kafka/blob/master/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java

You will find both the error messages in the sender code. The default value of metadata.max.age.ms is 3 seconds. I think you should reduce this value and then observe the behavior.

Thanks for the hint about metadata.max.age.ms, but the log message that displays the error also says "Going to request metadata update now", which implies that it's not waiting for the interval to end, but requests new metadata right now. The retries seem to be unaffected by that unfortunately... — smlgbl, May 15 '20 at 07:20
Added 2 more preceeding lines from the log. It seems that an error during the retries doesn't even trigger a metadata update, but only once the retries are exceeded and the exception comes back up... — smlgbl, May 15 '20 at 12:25

score 1 · Answer 2 · answered Sep 12 '22 at 13:18

you need config listeners properly

I'm using docker-compose like

services:
  zookeeper:
    container_name: zookeeper
    ports:
      - "2181:2181"
    ...
  broker-1:
    hostname: "broker-1.mydomain.com"
    ports:
      - "29091:29091"
    ...
  broker-2:
    hostname: "broker-2.mydomain.com"
    container_name: broker-2
    ports:
      - "29092:29092"

edit server.properties for each broker

broker-1

listeners: PRIVATE_HOSTNAME://broker-1.mydomain.com:9092,PUBLIC_HOSTNAME://broker-1.mydomain.com:29091
advertised.listeners: PRIVATE_HOSTNAME://broker-1.mydomain.com:9092,PUBLIC_HOSTNAME://broker-1.mydomain.com:29091
listener.security.protocol.map: PUBLIC_HOSTNAME:PLAINTEXT,PRIVATE_HOSTNAME:PLAINTEXT
inter.broker.listener.name: PRIVATE_HOSTNAME

broker-2

listeners: PRIVATE_HOSTNAME://broker-2.mydomain.com:9092, PUBLIC_HOSTNAME://broker-2.mydomain.com:29092
advertised.listeners: PRIVATE_HOSTNAME://broker-2.mydomain.com:9092, PUBLIC_HOSTNAME://broker-2.mydomain.com:29092
listener.security.protocol.map: PUBLIC_HOSTNAME:PLAINTEXT, PRIVATE_HOSTNAME:PLAINTEXT
inter.broker.listener.name: PRIVATE_HOSTNAME

IMPORTANT: note that I'm using the same hostname for private and public net, because the consumer/producer can only access to kafka by register name like this:

    [BrokerToControllerChannelManager broker=1 name=forwarding]: Recorded new controller, from now on will use broker broker-1.mydomain.com:9092
...
    [BrokerToControllerChannelManager broker=2 name=forwarding]: Recorded new controller, from now on will use broker broker-2.mydomain.com:9092

edit your host /etc/hosts

127.0.0.1   broker-1.mydomain.com
127.0.0.1   broker-2.mydomain.com

score 0 · Answer 3 · answered Mar 22 '23 at 22:05

0

My solve (On IOs) was to

first kill zookeeper and Kafka servers and any clients. So get Kafka quiet.

cd /tmp rm -rf zookeeper Kafka-logs

Then restart Zookeeper and then Kafka.

I would expect that on Linux it is the same, and on Windows you're have to find the directory where Kafka-logs and zookeeper state files are stored.

answered Mar 22 '23 at 22:05

Richard Keene

398
3
14

BE CAREFUL WITH THAT rm -rf command. It is very powerful. – Richard Keene Mar 22 '23 at 22:05

Kafka producer fails to send messages with NOT_LEADER_FOR_PARTITION exception

3 Answers3

Linked