Redis/Sentinel cluster failover results in "failover-abort-not-elected master"

Question

I have a 2 node Redis cluster set-up.

[master] 192.168.56.102: Redis Master (:6379), Redis Slave (:6380), Sentinel(:26379), Sentinel#2(:26380)

[rescue] 192.168.56.103: Redis Master (:6379), Redis Slave (:6380), Sentinel(:26379)

Each slave instance is a slave of the master instance on the same machine. Each sentinel instance monitors both master instances.

I am using the above in conjunction with twemproxy (which has nothing to do with this question) and a client-reconfig-script to update twemproxy configuration so that the application keeps working.

I am stopping servers to see what is happening and if everything is working correctly.

[master] stop redis master: The quorum is able to elect a new master successfully. Log below.

==> /tmp/sentinel.log <==
[14701] 29 Dec 18:16:55.096 # +sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:16:55.096 # +sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14701] 29 Dec 18:18:04.187 # -sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:18:04.236 # -sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:18:14.160 * +convert-to-slave slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:47:47.151 # +sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14701] 29 Dec 18:47:47.170 # +sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14701] 29 Dec 18:47:57.650 * +reboot slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:47:57.652 * +reboot slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:47:57.715 # -sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14701] 29 Dec 18:47:57.738 # -sdown slave 192.168.56.102:6379 192.168.56.102 6379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:48:08.088 # +sdown master master 192.168.56.102 6380
[14701] 29 Dec 18:48:08.180 # +sdown master master 192.168.56.102 6380
[14701] 29 Dec 18:48:08.280 # +odown master master 192.168.56.102 6380 #quorum 2/2
[14701] 29 Dec 18:48:08.280 # +new-epoch 73
[14701] 29 Dec 18:48:08.280 # +try-failover master master 192.168.56.102 6380
[14701] 29 Dec 18:48:08.471 # +vote-for-leader a664b9f61df2b10bbbb5d865b01c599ddd36183c 73
[14701] 29 Dec 18:48:08.472 # 192.168.56.103:26379 voted for 491b32b95c547a8266faf9b04ce6b0c18486236b 73
[14705] 29 Dec 18:48:08.473 # +new-epoch 73
[14705] 29 Dec 18:48:08.475 # +vote-for-leader 491b32b95c547a8266faf9b04ce6b0c18486236b 73
[14701] 29 Dec 18:48:08.475 # 192.168.56.102:26380 voted for 491b32b95c547a8266faf9b04ce6b0c18486236b 73
[14701] 29 Dec 18:48:08.835 # +config-update-from sentinel 192.168.56.103:26379 192.168.56.103 26379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:48:08.835 # +config-update-from sentinel 192.168.56.103:26379 192.168.56.103 26379 @ master 192.168.56.102 6380
[14705] 29 Dec 18:48:08.836 # +switch-master master 192.168.56.102 6380 192.168.56.102 6379
[14701] 29 Dec 18:48:08.836 # +switch-master master 192.168.56.102 6380 192.168.56.102 6379
[14701] 29 Dec 18:48:08.836 * +slave slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14705] 29 Dec 18:48:08.836 * +slave slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379

==> /tmp/sent.log <==
Failing master: 192.168.56.102:6380
New Master: 192.168.56.102:6379

Failing master: 192.168.56.102:6380
New Master: 192.168.56.102:6379


==> /tmp/sentinel.log <==
[14705] 29 Dec 18:48:11.855 # +sdown slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14701] 29 Dec 18:48:11.903 # +sdown slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379

You can also see that my reconfigure script is called successfully and outputs some details on the screen.

The problem is when i am trying to stop the master instance on the [rescue] machine. The sentinel continuously reports "-failover-abort-not-elected master resque 192.168.56.103 6379"

==> /tmp/sentinel.log <==
[14705] 29 Dec 18:48:11.855 # +sdown slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14701] 29 Dec 18:48:11.903 # +sdown slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14705] 29 Dec 18:48:43.401 # -sdown slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14701] 29 Dec 18:48:43.433 # -sdown slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14705] 29 Dec 18:48:53.344 * +convert-to-slave slave 192.168.56.102:6380 192.168.56.102 6380 @ master 192.168.56.102 6379
[14705] 29 Dec 18:49:23.617 # +sdown master resque 192.168.56.103 6379
[14701] 29 Dec 18:49:23.625 # +sdown master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:23.674 # +odown master resque 192.168.56.103 6379 #quorum 2/2
[14705] 29 Dec 18:49:23.674 # +new-epoch 74
[14705] 29 Dec 18:49:23.674 # +try-failover master resque 192.168.56.103 6379
[14701] 29 Dec 18:49:23.727 # +odown master resque 192.168.56.103 6379 #quorum 3/2
[14701] 29 Dec 18:49:23.727 # +new-epoch 74
[14701] 29 Dec 18:49:23.727 # +try-failover master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:23.886 # +vote-for-leader b608fcab7a201799826f4d9ee839aed3cf556fdf 74
[14701] 29 Dec 18:49:23.889 # +vote-for-leader a664b9f61df2b10bbbb5d865b01c599ddd36183c 74
[14701] 29 Dec 18:49:23.890 # 192.168.56.102:26380 voted for b608fcab7a201799826f4d9ee839aed3cf556fdf 74
[14705] 29 Dec 18:49:23.890 # 192.168.56.102:26379 voted for a664b9f61df2b10bbbb5d865b01c599ddd36183c 74
[14705] 29 Dec 18:49:23.893 # 192.168.56.103:26379 voted for b608fcab7a201799826f4d9ee839aed3cf556fdf 74
[14701] 29 Dec 18:49:23.893 # 192.168.56.103:26379 voted for b608fcab7a201799826f4d9ee839aed3cf556fdf 74
[14705] 29 Dec 18:49:23.980 # +elected-leader master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:23.980 # +failover-state-select-slave master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:24.064 # -failover-abort-no-good-slave master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:24.117 # Next failover delay: I will not start a failover before Mon Dec 29 18:49:32 2014
[14701] 29 Dec 18:49:28.417 # -failover-abort-not-elected master resque 192.168.56.103 6379
[14701] 29 Dec 18:49:28.489 # Next failover delay: I will not start a failover before Mon Dec 29 18:49:32 2014
[14705] 29 Dec 18:49:32.217 # +new-epoch 75
[14705] 29 Dec 18:49:32.217 # +try-failover master resque 192.168.56.103 6379
[14701] 29 Dec 18:49:32.423 # +new-epoch 75
[14701] 29 Dec 18:49:32.424 # +try-failover master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:32.433 # +vote-for-leader b608fcab7a201799826f4d9ee839aed3cf556fdf 75
[14705] 29 Dec 18:49:32.435 # 192.168.56.103:26379 voted for 491b32b95c547a8266faf9b04ce6b0c18486236b 75
[14701] 29 Dec 18:49:32.437 # +vote-for-leader a664b9f61df2b10bbbb5d865b01c599ddd36183c 75
[14701] 29 Dec 18:49:32.438 # 192.168.56.102:26380 voted for b608fcab7a201799826f4d9ee839aed3cf556fdf 75
[14705] 29 Dec 18:49:32.438 # 192.168.56.102:26379 voted for a664b9f61df2b10bbbb5d865b01c599ddd36183c 75
[14701] 29 Dec 18:49:32.438 # 192.168.56.103:26379 voted for 491b32b95c547a8266faf9b04ce6b0c18486236b 75
[14705] 29 Dec 18:49:36.636 # -failover-abort-not-elected master resque 192.168.56.103 6379
[14705] 29 Dec 18:49:36.691 # Next failover delay: I will not start a failover before Mon Dec 29 18:49:40 2014
[14701] 29 Dec 18:49:36.843 # -failover-abort-not-elected master resque 192.168.56.103 6379
[14701] 29 Dec 18:49:36.905 # Next failover delay: I will not start a failover before Mon Dec 29 18:49:40 2014

A new master is not elected and the reconfigure script is not called.

I understand that for a new master to be elected, the quorum (n/2+1) will have to agree. This is why I am using 3 sentinels to test.

I don't understand why the above example does not end up in an election (like [master]).

I am using Redis server v=2.8.19 sha=0a21368c:1 malloc=jemalloc-3.6.0 bits=64 build=e570b291804f6e35

Thanks for any help and please don't mind the gross misspelling of the word rescue!

--edit

[master] redis slave config:

daemonize yes
pidfile "/var/run/redis/redis-server-slave.pid"
port 6380
tcp-backlog 511
bind 192.168.56.102
timeout 0
tcp-keepalive 0
loglevel notice
logfile "/var/log/redis/redis-server.log"
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "/var/lib/redis"
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
maxclients 4064
# Generated by CONFIG REWRITE

slaveof 192.168.56.102 6379

[master] master instance run_id: 5c1ffb7742ad78cde12dbe4858747a314adaebe9

[master] slave instance run_id: 5772f437519bb38782d38f6675ae5d9157be2419

[rescue] master instance run_id: 5c1ffb7742ad78cde12dbe4858747a314adaebe9

[rescue] slave instance run_id: 5772f437519bb38782d38f6675ae5d9157be2419

--

[master] Sentinel (:26379) run_id: a664b9f61df2b10bbbb5d865b01c599ddd36183c

[master] Sentinel (:26380) run_id: b608fcab7a201799826f4d9ee839aed3cf556fdf

[slave] Sentinel (:26379) run_id: a664b9f61df2b10bbbb5d865b01c599ddd36183c

Sentinel information

# Sentinel
sentinel_masters:2
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=resque,status=ok,address=192.168.56.103:6379,slaves=0,sentinels=3
master1:name=master,status=ok,address=192.168.56.102:6379,slaves=1,sentinels=3

Sentinel conf of 192.168.56.103

port 26379
logfile "/tmp/sentinel.log"
dir "/tmp"

sentinel monitor resque 192.168.56.103 6379 2
sentinel down-after-milliseconds resque 3000
sentinel failover-timeout resque 4000
sentinel client-reconfig-script resque /home/sm0ke/Projects/git/thrace/extra/sentinel-failover.py

sentinel config-epoch resque 0
sentinel leader-epoch resque 76
sentinel known-sentinel resque 192.168.56.102 26379 a664b9f61df2b10bbbb5d865b01c599ddd36183c
sentinel known-sentinel resque 192.168.56.102 26380 b608fcab7a201799826f4d9ee839aed3cf556fdf

maxclients 4064
daemonize yes
# Generated by CONFIG REWRITE
sentinel monitor master 192.168.56.102 6379 2
sentinel down-after-milliseconds master 3000
sentinel failover-timeout master 4000
sentinel client-reconfig-script master /home/sm0ke/Projects/git/thrace/extra/sentinel-failover.py
sentinel config-epoch master 73
sentinel leader-epoch master 73
sentinel known-slave master 192.168.56.102 6380
sentinel known-sentinel master 192.168.56.102 26380 b608fcab7a201799826f4d9ee839aed3cf556fdf
sentinel known-sentinel master 192.168.56.102 26379 a664b9f61df2b10bbbb5d865b01c599ddd36183c
sentinel current-epoch 76

redis-cli info for Sentinel on 192.168.56.103

# Server
redis_version:2.8.19
redis_git_sha1:0a21368c
redis_git_dirty:1
redis_build_id:e570b291804f6e35
redis_mode:sentinel
os:Linux 3.2.0-4-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:8345
run_id:491b32b95c547a8266faf9b04ce6b0c18486236b
tcp_port:26379
uptime_in_seconds:8054
uptime_in_days:0
hz:17
lru_clock:10594345
config_file:/etc/redis/sentinel.conf

# Sentinel
sentinel_masters:2
sentinel_tilt:0
sentinel_running_scripts:16
sentinel_scripts_queue_length:4
master0:name=resque,status=ok,address=192.168.56.103:6379,slaves=0,sentinels=3
master1:name=master,status=ok,address=192.168.56.102:6379,slaves=1,sentinels=3

What's the config look like for the 6380 instance - particularly, is there a `slave-priority` set? Also, what are all of these instances - there are votes for `491b32b`, `a664b9`, and `b608fc` - check a sentinel's config file for the `resque` instance for what listeners those correspond to, something's funky if there are votes for 3 different instances. The `failover-abort-no-good-slave` isn't concerning, that's expected when the majority voted for the instance that just failed - the question is why those sentinels think they should be voting for it. — Shane Madden, Dec 29 '14 at 18:21
Both slave instances have a slave-priority of 100. I have updated my question with the [master] redis slave configuration. — sm0ke21, Dec 29 '14 at 18:36
Is that the slave on the `master` node or the slave on the `rescue` node? And can you check what those instances IDs are? — Shane Madden, Dec 29 '14 at 18:38
It is the slave of the master node. As far as instance IDs are concerned, i don't know what those are. I am googling it right now. — sm0ke21, Dec 29 '14 at 18:42
I have updated my question with some more information (run_ids). These seem to be the same for both masters and both slaves recpectively. Is this a problem? I could not find out what instance IDs are Shane. — sm0ke21, Dec 29 '14 at 18:47
Yes, that could be part of the problem if multiple nodes are in the same master group. Can you clarify what your replication topology is supposed to look like here - is the `rescue` node getting data from the master node? I'm confused why you're providing the `192.168.56.102:6380` config when the problem you're having is on the `192.168.56.103` system? Can you provide what's in the `sentinel.conf`, and connect `redis-cli` to the sentinel port and provide the output of `sentinel masters` / `sentinel slaves resque`? — Shane Madden, Dec 29 '14 at 19:00
Please excuse me, i am getting a bit confused myself. I have updated my question with the information you requested. Moreover, my target topology is a cluster with 2 nodes. Each node runs a master and a slave redis instance. The slave instances get data from the the masters on the same machine. I know this doesn't make sense, but I am first trying to debug this problem. In the end each slave will replicate the master of the **other** node. — sm0ke21, Dec 29 '14 at 19:17

Shane Madden · Accepted Answer · 2014-12-29T19:43:17.210

4

Ignore me on the instance ID confusion, that's my fault - I forgot that the sentinels were voting for a leader among the sentinels, not voting for a new master among the candidates, so the three different IDs makes sense.

So, here's the real problem:

master0:name=resque,status=ok,address=192.168.56.103:6379,slaves=0,sentinels=3

There's no slaves there that the sentinels have noticed, so they don't have a good node to make live during the vote.

Check the slaveof config on the 192.168.56.103:6380 instance, make sure that instance is running, and connect to it to check its info to verify that it's slaving. Once the 192.168.56.103:6379 sees it as a slave (in its info command), the sentinels will pick it up as a known slave and be able to fail over to it.

edited Dec 29 '14 at 19:43

answered Dec 29 '14 at 19:37

Shane Madden

114,520
13
181
251

I saw that too. It seems (for some reason that is beyond me) that the [rescue] slave was a slave of the [master] master redis instance. I have corrected that by issuing a slaveof 192.168.56.103 6379 and now it seems that the failover is done successfully. – sm0ke21 Dec 29 '14 at 19:46
@sm0ke21 Great! – Shane Madden Dec 29 '14 at 19:47
4

Hello. Im having the same problem but in my case I have `master0:name=redis-master,status=ok,address=172.29.245.6:6379,slaves=2,sentinels=3` but Im still unable to failover. I still have `+vote-for-leader 61c51a37f2db1673ddcf5dc3fe6816c9ba83a408 42 -failover-abort-not-elected master redis-master 172.29.245.6 6379` – Jason Stanley Oct 27 '17 at 15:37

score 0 · Answer 2 · answered Sep 14 '20 at 14:00

0

In my case this happens because i have setup redis with authentication so i added below parameter in sentinel.conf to accommodate inter-cluster authentication

sentinel auth-pass prodcluster MyStrongPassword

answered Sep 14 '20 at 14:00

Mansur Ul Hasan

262
3
9

Redis/Sentinel cluster failover results in "failover-abort-not-elected master"

2 Answers2