1

I'm working on moving all our Kafka traffic over SSL. We have two clusters in each region.

Using Kafka version 2.7.0.

All regions and all clusters work fine over SSL except one cluster.

Among other tools, I use kafkacat to probe the cluster.

When kafkacat -L is executed against this cluster over plaintext connection, it lists all brokers, topics and partitions and shows the leader of each partition:

# kafkacat -b kafka-cluster1-kafka-brokers.domain.com:9092 -L | head
Metadata for all topics (from broker -1: kafka-cluster1-kafka-brokers.domain.com:9092/bootstrap):
 4 brokers:
  broker 1 at kafka-cluster1-kafka-1.domain.com:9092 (controller)
  broker 4 at kafka-cluster1-kafka-4.domain.com:9092
  broker 2 at kafka-cluster1-kafka-2.domain.com:9092
  broker 3 at kafka-cluster1-kafka-3.domain.com:9092
 49 topics:
  topic "topic.name" with 4 partitions:
    partition 0, leader 3, replicas: 3,2,1, isrs: 1,2,3
    partition 1, leader 1, replicas: 1,3,4, isrs: 1,3,4

When executing the same command over SSL, kafkacat finds:

  1. 0 brokers
  2. Lists topics and partitions, but no leaders
# kafkacat -b kafka-cluster1-kafka-brokers.domain.com:9093 -X security.protocol=SSL -X ssl.endpoint.identification.algorithm=none -X enable.ssl.certificate.verification=false -L | head
Metadata for all topics (from broker -1: ssl://kafka-cluster1-kafka-brokers.domain.com:9093/bootstrap):
 0 brokers:
 49 topics:
  topic "topic_name" with 4 partitions:
    partition 0, leader -1, replicas: 3,2,1, isrs: 1,2,3, Broker: Leader not available
    partition 1, leader -1, replicas: 1,3,4, isrs: 1,3,4, Broker: Leader not available
    partition 2, leader -1, replicas: 4,1,2, isrs: 1,2,4, Broker: Leader not available
    partition 3, leader -1, replicas: 2,4,3, isrs: 2,3,4, Broker: Leader not available

The same commands against the other cluster in the region, both over plaintext and over SSL, work perfectly.

The inter.broker.protocol and the communications with the Zookeeper cluster is PLAINTEXT. SSL for now is only used to talk to the Kafka clients. No authentication is used yet and the clients do not verify the server's certificate.

The clusters are built using Packer for AMI and Terraform for deployment, it's all automatic. I've triple-checked that there is nothing different about the configuration of this cluster compared to the others.

The certificates used are issued by Let's Encrypt. I tried even copying the certificates from the other cluster in the region, which works fine, but I still get the same result.

Other than the host names, the configuration is identical between the cluster which works with SSL and the one which doesn't.

What else could cause such an odd behaviours?

EDIT: More investigation shows that the Zookeeper records for the Kafka brokers on this cluster miss the SSL mappings:

{"features":{},"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://kafka-cluster1-kafka-1.domain.com:9092"],"rack":"ap-southeast-2a","jmx_port":9999,"port":9092,"host":"kafka-cluster1-kafka-1.domain.com","version":5,"timestamp":"1628554957052"}

As opposed to a "healthy" cluster:

{"features":{},"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT","SSL":"SSL"},"endpoints":["PLAINTEXT://kafka-cluster2-kafka-1.domain.com:9092","SSL://kafka-cluster2-kafka-1.domain.com:9093"],"rack":"ap-southeast-2a","jmx_port":9999,"port":9092,"host":"kafka-cluster2-kafka-1.domain.com","version":5,"timestamp":"1626842428002"}

The Zookeeper record gets deleted when we stop the broker, but appears with the same wrong content when we start it.

The broker's server.properties file used NOT to have the listener.security.protocol line because we want the default, but even uncommenting it like this didn't make a difference:

listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

So the question now is - where does Kafka get the information that it puts in the Zookeeper record?

Capt. Crunch
  • 4,490
  • 6
  • 32
  • 38

1 Answers1

1

The problem turned out to be that the Kafka dynamic configuration was set on that cluster and overwrote the "static" configuration as set in the text files.

# /opt/kafka/bin/kafka-configs.sh --bootstrap-server $(hostname):9092 --entity-type brokers --entity-name 1 --describe
Dynamic configs for broker 1 are:
  advertised.listeners=PLAINTEXT://kafka-cluster1-kafka-1.domain.com:9092 sensitive=false synonyms={DYNAMIC_BROKER_CONFIG:advertised.listeners=PLAINTEXT://kafka-cluster1-kafka-1.domain.com:9092, STATIC_BROKER_CONFIG:advertised.listeners=PLAINTEXT://kafka-cluster1-kafka-1.domain.com:9092,SSL://kafka-cluster1-kafka-1.domain.com:9093}

Other clusters, which didn't have this problem, had this record empty. Removing if for the bad cluster fixed the issue and kafkacat started working over SSL the same as it did for PLAINTEXT.

The command to remove:

/opt/kafka/bin/kafka-configs.sh --bootstrap-server $(hostname):9092 --entity-type brokers --entity-name x --alter --delete-config advertised.listeners

Where x is each broker.id in the cluster.

Capt. Crunch
  • 4,490
  • 6
  • 32
  • 38