Context:
I'm trying to understand the motivation behind existence of WaitBeforeForcingMasterFailover
property (and the code associated with it) inside of ServiceStack.Redis.RedisSentinel
.
If I interpreted the code right - the meaning behind this property seems to cover cases like:
- We have a connection to a healthy sentinel that tells us that a master is at host X
- When we try to establish a connection to the master at host X - we fail due to some reason.
So the logic will be - if we continuously fail to create a connection to X for WaitBeforeForcingMasterFailover
period - initiate a force failover.
The failover does not need to reach a quorum and can elect a new master just with 1 sentinel available.
SENTINEL FAILOVER Force a failover as if the master was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations).
Source: https://redis.io/topics/sentinel#sentinel-api
The way it seems to me - this feature can be beneficial in some cases and troublesome in other cases.
For example in case of a network partition if a client is left connected to a minority of sentinels (they can't reach a quorum) and these sentinels point to a master that is no longer reachable - this force failover option will trigger a failover within reachable partition, thus potentially creating a split brain situation.
Coming from Java background I also haven't seen such features available in popular redis clients such as Jedis and Lettuce.
This got me wondering on the following questions:
Are there strong reasons for this feature to be enabled by default? (I understand that you can effectively disable it if you want to by setting a huge value in it). Do they really worth the risk of interfering with natural sentinels workflow and potentially introducing problems like the one I've mentioned before?
Will the library work fine with this option disabled? Are there are cases that I might have missed and turning this feature off will lead to problems even with some happy paths (no network partition, just regular failovers because of a deployment or a sudden node failure)?