Kafka cluster: Withstanding a network outage

Question

We have a kafka cluster. And a network. Yay. The network will be unavailable across all racks in our data center for 5-10 minutes (!) because maintenance requires it. I'm concerned that is too long an outage for kafka to handle gracefully and that it might start getting so confused about its state that it will not recover once the network is back online.

Is it a good idea to just shut the cluster down, and if so, what's the best way to take all the brokers offline?

It's a kafka 0.10.0 cluster running on 6 nodes distributed in different racks around the data center.

score 0 · Answer 1 · answered Jul 01 '18 at 12:43

Is it a good idea to just shut the cluster down

Maybe. Depends on your durability requirements, if you can tolerate losing data when recovering from this network isolation. Be absolutely sure you know what happens to your system in a network partition.

Jepsen project put Kafka through its paces a few years ago. Unclean leader election was a problem. A single in sync replica (ISR) could stay leader. If that last ISR was network partitioned or died, remaining nodes would elect a new leader, causing data loss. I think that is still the default until version 0.11.

Shutting down entirely prior to the network event means that there cannot be an unclean leader due to network partition. Have a look at controlled.shutdown.enable and auto.leader.rebalance options to make partition migration easier.

To choose durability, consider tuning such that a majority of replicas are required to ack writes, by setting acks to "all".

When a producer sets acks to "all" (or "-1"), min.insync.replicas specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful. If this minimum cannot be met, then the producer will raise an exception (either NotEnoughReplicas or NotEnoughReplicasAfterAppend). When used together, min.insync.replicas and acks allow you to enforce greater durability guarantees. A typical scenario would be to create a topic with a replication factor of 3, set min.insync.replicas to 2, and produce with acks of "all". This will ensure that the producer raises an exception if a majority of replicas do not receive a write.

default.replication.factor=3
min.insync.replicas=2
# Default from 0.11.0
unclean.leader.election.enable=false

With your current network, if you choose consistency you sacrifice availability. There can be no majority of replicas if none of them can talk to each other. Up to you whether this downtime is expensive enough to justify spreading the cluster across multiple network failure domains.

Thanks for the detailed response! When you say 'data loss' I assume that you mean new logs coming in from producers or position information for clients may be incorrect; logs already in the system will not become corrupt, correct? If so, new log data loss is acceptable for the time of the outage plus shutdown/bring up. — Eric Horne, Jul 02 '18 at 15:12
All writes committed on the old leader during the partition could be lost; its history does not agree with the new leader. See the diagrams in the post I linked. If you don't care about data loss during the event, you might not care as much about unclean elections or majority of replicas. — John Mahowald, Jul 04 '18 at 15:14
It will be a complete network outage, so there will be no logs coming in at all during the outage. My bigger concern was all the brokers suddenly not seeing any of the other brokers or zookeepers and then trying to take over their respective replicas -- that might create a strange situation when all the brokers are suddenly visible again even with very low (or no) producer traffic. Based on some testing I've done, the brokers look like they won't get as confused as I thought they might.. I can post back results. — Eric Horne, Jul 04 '18 at 20:19

score 0 · Answer 2 · answered Jul 22 '18 at 17:15

The outage ended up not being as severe as we thought it would be.

The cluster was left up for the network outage. All kafka clients were shutdown so the cluster was quiet prior to the outage. Outage was about 3 minutes. Upon coming back online, the cluster was allowed to reconverge and it seems to have done just that. A preferred leader election was requested and all brokers/all topics returned to a good state. Once stable, the kafka clients were brought back online and everything worked.

So.. for this situation, the right thing to do is quiet the kafka cluster, but don't bring down any of the brokers; let it ride -- it will recover. Of course, this assumes you can accommodate the data loss during the outage.

Kafka cluster: Withstanding a network outage

2 Answers2