I have Kafka cluster of 3 nodes. I am using kafkacat to list data from Kafka. I configure PLAINTEXT and VPN_PLAINTEXT listeners:
listeners=PLAINTEXT://0.0.0.0:6667,VPN_PLAINTEXT://0.0.0.0:6669
advertised.listeners=PLAINTEXT://hadoop-kafka1-stg.local.company.cloud:6667,VPN_PLAINTEXT://hadoop-kafka1-stg-vip.local.company.cloud:6669
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,VPN_PLAINTEXT:PLAINTEXT
We find out, we cannot consume data from node 1 (only) - from topics where partition leader is node 1 with error:
kafkacat -C -b hadoop-kafka1-stg-vip.local.company.cloud:6669 -t <topic-name> -o beginning -e -q -p 11
% ERROR: Topic <topic-name> [11] error: Broker: Not leader for partition
I can see, node 1 is leader for this partition:
Metadata for <topic-name> (from broker 3: hadoop-kafka3-stg-vip.local.company.cloud:6669/3):
3 brokers:
broker 2 at hadoop-kafka2-stg-vip.local.company.cloud:6669
broker 3 at hadoop-kafka3-stg-vip.local.company.cloud:6669 (controller)
broker 1 at hadoop-kafka1-stg-vip.local.company.cloud:6669
1 topics:
topic "<topic-name>" with 12 partitions:
partition 0, leader 2, replicas: 2,1,3, isrs: 3,2,1
partition 1, leader 3, replicas: 3,2,1, isrs: 3,2,1
partition 2, leader 1, replicas: 1,3,2, isrs: 3,2,1
partition 3, leader 2, replicas: 2,3,1, isrs: 3,2,1
partition 4, leader 3, replicas: 3,1,2, isrs: 3,2,1
partition 5, leader 1, replicas: 1,2,3, isrs: 3,2,1
partition 6, leader 2, replicas: 2,1,3, isrs: 3,2,1
partition 7, leader 3, replicas: 3,2,1, isrs: 3,2,1
partition 8, leader 1, replicas: 1,3,2, isrs: 3,2,1
partition 9, leader 2, replicas: 2,3,1, isrs: 3,2,1
partition 10, leader 3, replicas: 3,1,2, isrs: 3,2,1
partition 11, leader 1, replicas: 1,2,3, isrs: 3,2,1
I thought the data on node could be corrupted, so I remove everything from data directory kafka_data_dir
for Kafka. When I start the daemon, I could see it syncing. After that, the issue persists. There is nothing suspicious in logs.
Could anybody describ and help to find out where is the root cause? Only node number 1 encounter this issue. When I ask the same node on port 6667, it works smoothly.