Kubernetes Stateful set, AZ and Volume claims: what happens when an AZ fails

Question

Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:

cassandra-0 -> zone a
cassandra-1 -> zone b
cassandra-2 -> zone c

Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.

If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.

Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:

cassandra-1 -> zone b
cassandra-2 -> zone c

Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.

Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node. It shouldn't be a problem.

However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.

So if you use Statefulset + VolumeClaim + across zones AND you experience an entire AZ failure AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure

=> then you will loose all your Cassandra pods. Up until zone-a is back online

This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?

score 2 · Answer 1 · answered Feb 09 '18 at 05:23

2

Starting with Kubernetes 1.7 you can tell Kubernetes to relax the StatefulSet ordering guarantees using the podManagementPolicy option (documentation). By setting that option to Parallel Kubernetes will no longer guarantee any ordering when starting or stopping pods and start pods in parallel. This can have an impact on your service discovery, but should resolve the issue you're talking about.

answered Feb 09 '18 at 05:23

Lorenz

2,179
3
19
18

This is useful. It might just be a bit too generic. I still need my Cassandra seed nodes to boot first, then only I can create the other ones in parallel. What I've done now is created 3 statefulsets with 3 storageclass. Using affinity, I can constraint the statefulsets to be provisioned in their own AZ. This means I can keep the ordered way to start and terminate but I do that per AZ. I can easily control pods per AZ. My tests have been conclusive so far. podManagementPolicy is a bit too simple for my use case right now – VinceMD Feb 20 '18 at 22:48
Stale link, could you possibly update it? – Kostrahb Jul 10 '21 at 13:34

VinceMD · Answer 2 · 2018-05-16T00:56:34.817

Two options:

Option 1: use podManagementPolicy and set it to Parallel. The pod-1 and pod-2 will crash a few times until the seed node (pod-0) is available. This happens when creating the statefulset the first time. Also note that Cassandra documentation used to recommend NOT creating multiple nodes in parallel but it seems recent updates makes this not true. Multiple nodes can be added to the cluster at the same time

Issue found: if using 2 seed nodes, you will get a split brain scenario. Each seed node will be created at the same time and create 2 separate logical Cassandra clusters

Option 1 b: use podManagementPolicy and set it to Parallel and use ContainerInit. Same as option 1 but use an initContainer https://kubernetes.io/docs/concepts/workloads/pods/init-containers/. The init container is a short lived container which has for role to check that the seed node is available before starting the actual container. This is not required if we are happy for the pod to crash until the seed node is available again The problem is that Init Container will always run which is not required. We want to ensure the Cassandra cluster was well formed the first time it was created. After that it does not matter

Option 2: create 3 different statefulets.

1 statefulset per AZ/Rack. Each statefulset has constraints so it can run only on nodes in the specific AZ. I've also got 3 storage classes (again constraint to a particular zone), to make sure the statefulset does not provision EBS in the wrong zone (statefulset does not handle that dynamically yet) In each statefulset I've got a Cassandra seed node (defined as environment variable CASSANDRA_SEEDS which populates SEED_PROVIDER at run time). That makes 3 seeds which is plenty. My setup can survive a complete zone outage thanks to replication-factor=3

Tips:

the list of seed node contains all 3 nodes separated by commas: "cassandra-a-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-b-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-c-0.cassandra.MYNAMESPACE.svc.cluster.local"
Wait until the first seed (cassandra-a-0) is ready before creating the other 2 statefulsets. Otherwise you get a split brain. This is only an issue when you create the cluster. After that, you can loose one or two seed nodes without impact as the third one is aware of all the others.

score 0 · Answer 3 · answered Feb 24 '22 at 18:14

I think that if you can control the deployment of each pod (cassandra-0, cassandra-1, cassandra-2 with three different yaml deployment files), you can use podAffinity set to a specific zone for each pod.

Once a node on a zone fails and the pod running inside that server has to be rescheduled, the affinity will force Kubernetes to deploy the pod on a different node of the same Zone, and if no nodes are available on the same zone, Kubernetes should keep that pod down indefinitely.

For example, you may create a Kubernetes cluster with three different managedNodeGroup, one for each zone (label "zone": "a", "b", "c" for each group), with at least two nodes for each group, and use the podAffinity.

Note: Do not use x1.32xlarge machines for the nodes :-)

Kubernetes Stateful set, AZ and Volume claims: what happens when an AZ fails

3 Answers3