Is it a good idea to just shut the cluster down
Maybe. Depends on your durability requirements, if you can tolerate losing data when recovering from this network isolation. Be absolutely sure you know what happens to your system in a network partition.
Jepsen project put Kafka through its paces a few years ago. Unclean leader election was a problem. A single in sync replica (ISR) could stay leader. If that last ISR was network partitioned or died, remaining nodes would elect a new leader, causing data loss. I think that is still the default until version 0.11.
Shutting down entirely prior to the network event means that there cannot be an unclean leader due to network partition. Have a look at controlled.shutdown.enable
and auto.leader.rebalance
options to make partition migration easier.
To choose durability, consider tuning such that a majority of replicas are required to ack writes, by setting acks to "all".
When a producer sets acks to "all" (or "-1"), min.insync.replicas
specifies the minimum number of replicas that must acknowledge a write
for the write to be considered successful. If this minimum cannot be
met, then the producer will raise an exception (either
NotEnoughReplicas or NotEnoughReplicasAfterAppend). When used
together, min.insync.replicas and acks allow you to enforce greater
durability guarantees. A typical scenario would be to create a topic
with a replication factor of 3, set min.insync.replicas to 2, and
produce with acks of "all". This will ensure that the producer raises
an exception if a majority of replicas do not receive a write.
default.replication.factor=3
min.insync.replicas=2
# Default from 0.11.0
unclean.leader.election.enable=false
With your current network, if you choose consistency you sacrifice availability. There can be no majority of replicas if none of them can talk to each other. Up to you whether this downtime is expensive enough to justify spreading the cluster across multiple network failure domains.