I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3), with each Zookeeper instance living in the same machine as each Kafka broker (Java 11.0.18). Everything worked fine for a long long time.
However, the first time a machine failed (so, both an instance of Zookeeper and a Kafka broker), the other 2 were unable to continue working (in this case, the leader failed). The 2 Zookeeper instances seemed like they couldn't communicate with each other, and were unable to elect a new leader. But that doesn't make sense, because they were communicating with each other before the failure. Only when the failing machine was booted up again, the other 2 machines were able to elect a new leader.
From the logs, I don't get much more information than what I explained above. The 2 living machines act like they don't "see" each other, and are unable to elect a leader. When the failing machine goes up again, they manage to elect a new leader.
Does anyone can help me shed some light on this problem?
Is there some configuration property I'm missing?
From my internet crawl, I got these 2 articles with problems similar to mine, but they don't give a clear answer to why this happened and how to fix it:
zookeeper issue - taking 15 minutes to recover if leader is killed
https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/