We have a big redis node in a cluster, and we're resharding it to other nodes. While doing that, we encountered many times the error reported below, which forced us to manually kill the instance using kill -9 and reload it.
In normal operation the node has been working for months without any problems. The only special thing about it is that there are some very big keys to migrate; for those usually we expected a timeout in the migration, but previously we didn't have this error which forces us to restart the node.
resharding log:
Moving slot 7929 from <source-node>:<port> to <target-node>:<port>: .................$
Moving slot 7930 from <source-node>:<port> to <target-node>:<port>: .................$
Moving slot 7931 from <source-node>:<port> to <target-node>:<port>: .................$
Moving slot 7932 from <source-node>:<port> to <target-node>:<port>: .................$
[ERR] Calling MIGRATE: Connection timed out
redis log:
=== REDIS BUG REPORT START: Cut & paste starting from here ===
29395:M 31 Jan 22:16:03.248 # Redis 3.2.3 crashed by signal: 11
29395:M 31 Jan 22:16:03.248 # Crashed running the instuction at: 0x468d4d
29395:M 31 Jan 22:16:03.248 # Accessing address: (nil)
29395:M 31 Jan 22:16:03.248 # Failed assertion: <no assertion failed> (<no file>:0)
------ STACK TRACE ------
EIP:
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](migrateCloseSocket+0x5d)[0x468d4d]
Backtrace:
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](logStackTrace+0x29)[0x45e699]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](sigsegvHandler+0xaa)[0x45ebca]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7ffec32ce340]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](migrateCloseSocket+0x5d)[0x468d4d]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](migrateCommand+0x6d6)[0x469526]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](call+0x85)[0x426fb5]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](processCommand+0x367)[0x42a0e7]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](processInputBuffer+0x105)[0x436d05]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](aeProcessEvents+0x218)[0x421428]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](aeMain+0x2b)[0x4216db]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster](main+0x410)[0x41e690]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7ffec2f19ec5]
/home/cybranding/redis-3.2.3/redis-stable/src/redis-server *:7010 [cluster][0x41e902]
------ INFO OUTPUT ------
# Server
redis_version:3.2.3
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:4992f89db2d932d
redis_mode:cluster
os:Linux 3.13.0-37-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.2
process_id:29395
run_id:ea2b51c12beba3394b49febc23e62ddda85ddf7d
tcp_port:7010
uptime_in_seconds:218901
uptime_in_days:2
hz:10
lru_clock:9505952
executable:/home/cybranding/redis-3.2.3/redis-stable/src/redis-server
config_file:/etc/redis_cluster_client2/redis-3.2.3/7010/redis.conf
# Clients
connected_clients:7
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:252229967880
used_memory_human:234.91G
used_memory_rss:226196897792
used_memory_rss_human:210.66G
used_memory_peak:261549868192
used_memory_peak_human:243.59G
total_system_memory:405738954752
total_system_memory_human:377.87G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:0.90
mem_allocator:jemalloc-4.0.3
# Persistence
loading:0
rdb_changes_since_last_save:2441620072
rdb_bgsave_in_progress:0
rdb_last_save_time:1485682059
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:78198652011
aof_base_size:71141120009
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:50546
total_commands_processed:56092302
instantaneous_ops_per_sec:470
total_net_input_bytes:8820578118
total_net_output_bytes:15465464075
instantaneous_input_kbps:65.53
instantaneous_output_kbps:39.50
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:70160
evicted_keys:0
keyspace_hits:25360054
keyspace_misses:16707970
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:1
# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:17440.44
used_cpu_user:20364.37
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Commandstats
cmdstat_get:calls=3013601,usec=69399477,usec_per_call=23.03
cmdstat_setex:calls=93465,usec=29887340,usec_per_call=319.77
cmdstat_incr:calls=6164797,usec=48002289,usec_per_call=7.79
cmdstat_sadd:calls=6404,usec=171138,usec_per_call=26.72
cmdstat_sismember:calls=1920066,usec=57777586,usec_per_call=30.09
cmdstat_zincrby:calls=29150890,usec=1841876323,usec_per_call=63.18
cmdstat_zrevrange:calls=298356,usec=100612284,usec_per_call=337.22
cmdstat_zscore:calls=2641113,usec=185788620,usec_per_call=70.34
cmdstat_expire:calls=10839539,usec=70516036,usec_per_call=6.51
cmdstat_ping:calls=69,usec=114,usec_per_call=1.65
cmdstat_info:calls=50357,usec=8779396,usec_per_call=174.34
cmdstat_cluster:calls=1034173,usec=262678023,usec_per_call=254.00
cmdstat_migrate:calls=879469,usec=6675930023,usec_per_call=7590.86
cmdstat_command:calls=3,usec=10340,usec_per_call=3446.67
# Cluster
cluster_enabled:1
# Keyspace
db0:keys=141374638,expires=30371,avg_ttl=45048480
hash_init_value: 1485278005
------ CLIENT LIST OUTPUT ------
id=50534 addr=127.0.0.1:43431 fd=10 name= age=828 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=zincrby
id=50535 addr=127.0.0.1:43432 fd=11 name= age=828 idle=7 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sismember
id=50541 addr=127.0.0.1:58608 fd=16 name= age=346 idle=49 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=get
id=50108 addr=127.0.0.1:55292 fd=18 name= age=39041 idle=6 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=zrevrange
id=50546 addr=127.0.0.1:50654 fd=19 name= age=108 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=migrate
id=50543 addr=127.0.0.1:60246 fd=13 name= age=264 idle=264 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=info
id=50544 addr=127.0.0.1:54748 fd=17 name= age=165 idle=159 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=cluster
------ CURRENT CLIENT INFO ------
id=50546 addr=127.0.0.1:50654 fd=19 name= age=108 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=migrate
argv[0]: 'DEL'
argv[1]: 'nht.tw.urls.dec:쩐다'
------ REGISTERS ------
29395:M 31 Jan 22:16:03.283 #
RAX:00007fcc4d3f30e3 RBX:b8e0a2b8e097b8e0
RCX:0000000000000002 RDX:0000000000000001
RDI:00007fcc4d3f30f9 RSI:00000000004e7857
RBP:00007fdf418f6430 RSP:00007fff08e86490
R8 :0000000000000005 R9 :00007ffec2a0d5a0
R10:0000000000000002 R11:00007ffec2c00180
R12:0000000000000001 R13:00007fc0e9df1d40
R14:0000000000000002 R15:0000000000000000
RIP:0000000000468d4d EFL:0000000000010202
CSGSFS:0000000000000033
29395:M 31 Jan 22:16:03.283 # (00007fff08e8649f) -> 0000000000000058
29395:M 31 Jan 22:16:03.283 # (00007fff08e8649e) -> 0000000004af8cbe
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649d) -> 0000000000000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649c) -> 00007fc07adec0f0
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649b) -> 0000000000000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e8649a) -> 00007ffec39f5690
29395:M 31 Jan 22:16:03.284 # (00007fff08e86499) -> 0000000100000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e86498) -> 00007fc000000002
29395:M 31 Jan 22:16:03.284 # (00007fff08e86497) -> 00007fc0ee22c610
29395:M 31 Jan 22:16:03.284 # (00007fff08e86496) -> 00007fec6ad43fa0
29395:M 31 Jan 22:16:03.284 # (00007fff08e86495) -> 00007fc00000000a
29395:M 31 Jan 22:16:03.284 # (00007fff08e86494) -> 00007fc000000000
29395:M 31 Jan 22:16:03.284 # (00007fff08e86493) -> 0000000000469526
29395:M 31 Jan 22:16:03.284 # (00007fff08e86492) -> 0000000000000001
29395:M 31 Jan 22:16:03.284 # (00007fff08e86491) -> 00007fc07adec0f0
29395:M 31 Jan 22:16:03.284 # (00007fff08e86490) -> 00007fc0e9df1d40
------ FAST MEMORY TEST ------
29395:M 31 Jan 22:16:03.284 # Bio thread for job type #0 terminated
29395:M 31 Jan 22:16:03.285 # Bio thread for job type #1 terminated
*** Preparing to test memory region 724000 (94208 bytes)
*** Preparing to test memory region 1907000 (135168 bytes)
*** Preparing to test memory region 7fc053400000 (268131368960 bytes)
*** Preparing to test memory region 7ffec13ff000 (8388608 bytes)
*** Preparing to test memory region 7ffec1c00000 (14680064 bytes)
*** Preparing to test memory region 7ffec2a00000 (4194304 bytes)
*** Preparing to test memory region 7ffec32b9000 (20480 bytes)
*** Preparing to test memory region 7ffec34d8000 (16384 bytes)
*** Preparing to test memory region 7ffec39f5000 (16384 bytes)
*** Preparing to test memory region 7ffec3a01000 (4096 bytes)
*** Preparing to test memory region 7ffec3a02000 (8192 bytes)
*** Preparing to test memory region 7ffec3a06000 (4096 bytes)