I am running 3 zookeeper kubernetes statefulset for my kafka clusters. Named zookeeper-0 , zookeeper-1, zookeeper-2. And I have liveness probe enabled with ruok command. If any statefulset pod get restarted due to any failures the quorum will fail and zookeeper stops working even after the failed pod started up and its liveness probe responds ok.
When this happens I have to manually restart all zookeeper instances to get it back to working. This also happens when I do a helm upgrade. Because when I do a helm upgrade the first instance getting restarted is zookeeper-2 , then zookeeper-1 and finally zookeeper-0. But it seems zookeeper will only work if I start all instances together. So after every helm upgrade I have to restart all instances manually.
My question is:
What could be the reason for this behaviour? Also what is the best way to ensure a 100% reliable statefulset zookeeper in kubernetes environment?