Introduction:
Previously, I saw a similar question (this link), but mine is different as we use Kafka KRaft instead of Kafka with Zookeeper.
Specification:
Kafka version: 3.3.1
Number of brokers: 8
Minimum replication factor of topics: 3
Problem Description:
At the time of writing, I had experienced this issue numerous times. Kafka's log can be found here:
[2023-01-09 09:53:03,929] WARN [Controller 3] maybeFenceReplicas: failed with unknown server exception NotLeaderException at epoch 2641 in 1913 us. Renouncing leadership and reverting to the last committed offset 9986340. (org.apache.kafka.controller.QuorumController)
org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
at org.apache.kafka.raft.KafkaRaftClient.lambda$append$27(KafkaRaftClient.java:2262)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.kafka.raft.KafkaRaftClient.append(KafkaRaftClient.java:2261)
at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2257)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:813)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:792)
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:903)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:791)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2023-01-09 09:53:03,931] INFO [Controller 3] writeNoOpRecord: failed with NotControllerException in 415741179 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] writeNoOpRecord: failed with NotControllerException in 206629449 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 206629220 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 206626538 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 205746648 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 7549 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 6986 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 6399 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 5912 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,932] ERROR [Controller 3] Unexpected exception while executing deferred write event maybeFenceReplicas. Rescheduling for a minute from now. (org.apache.kafka.controller.QuorumController)
org.apache.kafka.common.errors.UnknownServerException: org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
Caused by: org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
at org.apache.kafka.raft.KafkaRaftClient.lambda$append$27(KafkaRaftClient.java:2262)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.kafka.raft.KafkaRaftClient.append(KafkaRaftClient.java:2261)
at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2257)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:813)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:792)
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:903)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:791)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
And
ERROR [Controller 3] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
As this is our production node, we constantly monitor it using Prometheus and Grafana. The timestamp indicates that this broker had trouble at 2023-01-09 09:53
. According to the monitoring, the other 7 brokers should be working properly and data-loss shouldn't occur, but the results from the monitoring are different from what we expected.
This issue has happened again at 11:31
.
Observations:
In this case, I assume that there is no data loss based on the monitoring screenshots and the topic messages.
Is this correct? How can we prevent this issue from recurring?