3-node Zookeeper ensemble unable to recover if leader fails

Question

I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3), with each Zookeeper instance living in the same machine as each Kafka broker (Java 11.0.18). Everything worked fine for a long long time.

However, the first time a machine failed (so, both an instance of Zookeeper and a Kafka broker), the other 2 were unable to continue working (in this case, the leader failed). The 2 Zookeeper instances seemed like they couldn't communicate with each other, and were unable to elect a new leader. But that doesn't make sense, because they were communicating with each other before the failure. Only when the failing machine was booted up again, the other 2 machines were able to elect a new leader.

From the logs, I don't get much more information than what I explained above. The 2 living machines act like they don't "see" each other, and are unable to elect a leader. When the failing machine goes up again, they manage to elect a new leader.

Does anyone can help me shed some light on this problem?

Is there some configuration property I'm missing?

From my internet crawl, I got these 2 articles with problems similar to mine, but they don't give a clear answer to why this happened and how to fix it:

zookeeper issue - taking 15 minutes to recover if leader is killed

https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/

3-node Zookeeper ensemble unable to recover if leader fails

0 Answers0