We're using spring-cloud-stream-binder-kafka (3.0.3.RELEASE) to send messages to our Kafka cluster (2.4.1). Every now and then one of the producer threads receives NOT_LEADER_FOR_PARTITION exceptions, and even exceeds the retries (currently set at 12, activated by dependency spring-retry). We've restricted the retries because we're sending about 1k msg/s (per producer instance) and were worried about the size of the buffer. This way we're regularly loosing messages, which is bad for downstream consumers, because we can't simply reproduce the incoming traffic.
The error message is
[Producer clientId=producer-5] Received invalid metadata error in produce request on partition topic-21 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now
[Producer clientId=producer-5] Got error produce response with correlation id 974706 on topic-partition topic-21, retrying (8 attempts left). Error: NOT_LEADER_FOR_PARTITION
[Producer clientId=producer-5] Got error produce response with correlation id 974707 on topic-partition topic-21, retrying (1 attempts left). Error: NOT_LEADER_FOR_PARTITION
Any known way to avoid this? Should we go back to the default of MAX_INT retries? Why does it keep sending to the same broker, even though it responded with NOT_LEADER_FOR_PARTITION?
Any hints are welcome.
EDIT: We just noticed that the broker metric kafka_network_requestmetrics_responsequeuetimems goes up around that time, but the max we've seen is around 2.5s