Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:
- cassandra-0 -> zone a
- cassandra-1 -> zone b
- cassandra-2 -> zone c
Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.
If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.
Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:
- cassandra-1 -> zone b
- cassandra-2 -> zone c
Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.
Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node. It shouldn't be a problem.
However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.
So if you use Statefulset + VolumeClaim + across zones AND you experience an entire AZ failure AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure
=> then you will loose all your Cassandra pods. Up until zone-a is back online
This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?