I'm working on moving all our Kafka traffic over SSL. We have two clusters in each region.
Using Kafka version 2.7.0.
All regions and all clusters work fine over SSL except one cluster.
Among other tools, I use kafkacat
to probe the cluster.
When kafkacat -L
is executed against this cluster over plaintext connection, it lists all brokers, topics and partitions and shows the leader of each partition:
# kafkacat -b kafka-cluster1-kafka-brokers.domain.com:9092 -L | head
Metadata for all topics (from broker -1: kafka-cluster1-kafka-brokers.domain.com:9092/bootstrap):
4 brokers:
broker 1 at kafka-cluster1-kafka-1.domain.com:9092 (controller)
broker 4 at kafka-cluster1-kafka-4.domain.com:9092
broker 2 at kafka-cluster1-kafka-2.domain.com:9092
broker 3 at kafka-cluster1-kafka-3.domain.com:9092
49 topics:
topic "topic.name" with 4 partitions:
partition 0, leader 3, replicas: 3,2,1, isrs: 1,2,3
partition 1, leader 1, replicas: 1,3,4, isrs: 1,3,4
When executing the same command over SSL, kafkacat
finds:
- 0 brokers
- Lists topics and partitions, but no leaders
# kafkacat -b kafka-cluster1-kafka-brokers.domain.com:9093 -X security.protocol=SSL -X ssl.endpoint.identification.algorithm=none -X enable.ssl.certificate.verification=false -L | head
Metadata for all topics (from broker -1: ssl://kafka-cluster1-kafka-brokers.domain.com:9093/bootstrap):
0 brokers:
49 topics:
topic "topic_name" with 4 partitions:
partition 0, leader -1, replicas: 3,2,1, isrs: 1,2,3, Broker: Leader not available
partition 1, leader -1, replicas: 1,3,4, isrs: 1,3,4, Broker: Leader not available
partition 2, leader -1, replicas: 4,1,2, isrs: 1,2,4, Broker: Leader not available
partition 3, leader -1, replicas: 2,4,3, isrs: 2,3,4, Broker: Leader not available
The same commands against the other cluster in the region, both over plaintext and over SSL, work perfectly.
The inter.broker.protocol
and the communications with the Zookeeper cluster is PLAINTEXT. SSL for now is only used to talk to the Kafka clients. No authentication is used yet and the clients do not verify the server's certificate.
The clusters are built using Packer for AMI and Terraform for deployment, it's all automatic. I've triple-checked that there is nothing different about the configuration of this cluster compared to the others.
The certificates used are issued by Let's Encrypt. I tried even copying the certificates from the other cluster in the region, which works fine, but I still get the same result.
Other than the host names, the configuration is identical between the cluster which works with SSL and the one which doesn't.
What else could cause such an odd behaviours?
EDIT: More investigation shows that the Zookeeper records for the Kafka brokers on this cluster miss the SSL mappings:
{"features":{},"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://kafka-cluster1-kafka-1.domain.com:9092"],"rack":"ap-southeast-2a","jmx_port":9999,"port":9092,"host":"kafka-cluster1-kafka-1.domain.com","version":5,"timestamp":"1628554957052"}
As opposed to a "healthy" cluster:
{"features":{},"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT","SSL":"SSL"},"endpoints":["PLAINTEXT://kafka-cluster2-kafka-1.domain.com:9092","SSL://kafka-cluster2-kafka-1.domain.com:9093"],"rack":"ap-southeast-2a","jmx_port":9999,"port":9092,"host":"kafka-cluster2-kafka-1.domain.com","version":5,"timestamp":"1626842428002"}
The Zookeeper record gets deleted when we stop the broker, but appears with the same wrong content when we start it.
The broker's server.properties
file used NOT to have the listener.security.protocol
line because we want the default, but even uncommenting it like this didn't make a difference:
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
So the question now is - where does Kafka get the information that it puts in the Zookeeper record?