StatefulSet breaking Kafka on worker reboot (unordered start)

Question

In a worker node reboot scenario (1.14.3), does the order of starting stateful sets pods matter, I have a confluent kafka (5.5.1) situation where 1 member start a lot before 0 and a bit ahead of 2, as a result I see a lot of crashes on 0 is there some mechanic here that breaks things? Starting is ordinal and delete is reversed, but what happens when order is broken?

  Started:      Sun, 02 Aug 2020 00:52:54 +0100 kafka-0
  Started:      Sun, 02 Aug 2020 00:50:25 +0100 kafka-1
  Started:      Sun, 02 Aug 2020 00:50:26 +0100 kafka-2
  Started:      Sun, 02 Aug 2020 00:28:53 +0100 zk-0
  Started:      Sun, 02 Aug 2020 00:50:29 +0100 zk-1
  Started:      Sun, 02 Aug 2020 00:50:19 +0100 zk-2

Read about Pod Management Policy here https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-management-policies. I believe your question is answered in a paragraph just above to the links paragraph. — Tarun Khosla, Aug 07 '20 at 13:11
@TarunKhosla, it is indeed helpful in terms of starting/delete, but as mentioned I have a stable running state, at some point of time IT department magically appear and patch on-premise worker nodes. I loging and I see 10 restart of pod 0 and 1-2 on 2 and 3. I am trying comprehend statefulset against endpoint and headless service management if something might be not updated due to the bad order. Most likely Kafka leaders or connection gets broken internally, but as it is managed by Confluent maybe something is not handled properly in this scenario SS wise. — anVzdGFub3RoZXJodW1hbg, Aug 07 '20 at 14:46

StatefulSet breaking Kafka on worker reboot (unordered start)

0 Answers0