Kafka cluster unavailability detection (3 brokers)

Question

I have a Apache Kafka cluster with 3 brokers and I would like to detect when the cluster is no longer available in order to switch the client connection to a second replicated cluster (as described here: How to consume from two different clusters in Kafka?).

All the topics on the cluster have a replication factor of 3, so all data shall be available within the cluster if a single nodes fails.

In this case, the cluster can be considered unusable if 2 brokers are offline. I am using Confluent.Kafka nuget package (https://www.nuget.org/packages/Confluent.Kafka/) to create a .NET client. However, both using the Producer and Consumer client functionalities, it is only possible to detect when all the brokers are down (by checking the Local_AllBrokersDown error code).

One solution would be to have a producer that continuously produces messages in a topic in order to 'heardbeat' the cluster. With the replication factor of 3, I set the min.insync.replicas for the specific topic to 2. According to the specification, if the producer uses Ack=All, I should receive a NotEnoughReplicas error code when trying to publish a message.

In practice, when 2 brokers go offline, my client application is connected to only the one broker left which cannot create a cluster by itself. If I use the KafkaManager on this remaining broker, it stil states that it is connected to another broker and the topic has 2 in-sync-replicas. The .NET client does therefore not receive NotEnoughReplicas error code (only Local_TimedOut error code from the remaining online broker). This might be designed in this way in order to avoid the split-brain...

Anybody has an idea on how I could monitor the availability of such a cluster - in this specific case, when 2 brokers are down?

Thank you!

One broker is still a valid "cluster". Just that the topic would be considered under-replicated. — OneCricketeer, Oct 25 '19 at 22:06
From what I can see, a single broker does not act as a cluster. The client gets a timeout when trying to connect to it. In the Kafka Manager interface, I can also see that this remaining node detects it lost connection to one of the other brokers (when they were still 2 left in the cluster - first node going down), but the disconnection from the other is not registered (when it is the only one left - second node going down). I guess this is by design in order to avoid split-brain but please correct me if I am wrong. — Stefan, Nov 04 '19 at 07:22
Sounds like you might have a networking misconfiguration. One broker can act on its own in a cluster, it just will fail only when the replication factor of various topics on it cannot be verified. — OneCricketeer, Nov 04 '19 at 14:52
Okay, @cricket_007 thank you very much! I will investigate why our cluster behaves in this way. — Stefan, Nov 05 '19 at 14:35

Kafka cluster unavailability detection (3 brokers)

0 Answers0