what happens when a node elected as leader goes down?

Question

My question is related to Leader Latch recipe.

I want to use Leader latch to implement a mutex for a scheduled job. There's another requirement: if the scheduled job starts at 1:00:00.005 PM and ends at 1:00:00.015 PM then no other job/instance should start the same task until 1:00:30.000 PM (for this I was thinking about implementing an asynchronous release in the job).

From the docs: https://curator.apache.org/curator-recipes/leader-latch.html

Error Handling

LeaderLatch instances add a ConnectionStateListener to watch for connection problems. If SUSPENDED or LOST is reported, the LeaderLatch that is the leader will report that it is no longer the leader (i.e. there will not be a leader until the connection is re-established). If a LOST connection is RECONNECTED, the LeaderLatch will delete its previous ZNode and create a new one.

Users of LeaderLatch must take account that connection issues can cause leadership to be lost. i.e. hasLeadership() returns true but some time later the connection is SUSPENDED or LOST. At that point hasLeadership() will return false. It is highly recommended that LeaderLatch users register a ConnectionStateListener.

If I understand correctly, in case the leader I1 (instance 1) goes down then the other instances will wait until I1 gets back online and reestablishes the connection. But what happens if I1 never gets up again? Will the other instances be able to become leaders? How and when? Or will the other instances be locked forever? How can they be unlocked?

My expectation is that, somehow, behind the scene, there should be a timeout for the leader connection. Maybe it might be related to how the Curator client is configured. Maybe when the connection is lost some reelection will happen. But none of this is described in the error handling section mentioned above nor in https://curator.apache.org/errors.html

score 1 · Answer 1 · answered Nov 11 '21 at 01:05

The wording is a little confusing, I'll admit, but I've worked with this extensively. If the current leader loses connection, it won't block anything for longer than the session timeout. The nodes created by the LeaderLatch for the election are ephemeral. If the current leader loses connection(You can configure the behavior to only trigger on LOST, not SUSPENDED), the leadership node associated with it will be automatically deleted by the server. That will trigger a new election among the remaining LeaderLatch participants, and a different server will become the new leader, resuming the leader's activities. You'll have to balance your connection and session timeouts with your need for rapid failover.

I think the documentation is referring to what happens from the disconnected Leader's perspective. After connection is lost, the LeaderLatch will alert any local listeners that it is no longer the leader, since it can't be determined locally until the connection is re-established. Once the connection is re-established, it will rejoin the leadership pool, but it won't resume leadership by default.

what happens when a node elected as leader goes down?

1 Answers1