relinquishing primary when one node down in three nodes mongo ReplicaSet

Question

I have a ReplicaSet with three nodes, 192.168.254.107:27023, 192.168.254.108:27023, 192.168.2.69:27023.

When 192.168.254.108:27023 is down (unexpected shutdown), then 192.168.2.69:27023 becomes primary.
But after a while, relinquishing primary, transition to SECONDARY. And the election and relinquishing keep looping execution.

The ReplicaSet is a shard of a cluster.

Here is the log:

403174 2016-07-20T09:41:32.054+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403175 2016-07-20T09:41:32.225+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403176 2016-07-20T09:41:32.226+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403177 2016-07-20T09:41:32.226+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403178 2016-07-20T09:41:34.218+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
403179 2016-07-20T09:41:34.218+0800 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
403180 2016-07-20T09:41:34.218+0800 I REPL     [replExecDBWorker-1] transition to SECONDARY
403181 2016-07-20T09:41:34.222+0800 I REPL     [ReplicationExecutor] Member 192.168.254.107:27023 is now in state SECONDARY
403182 2016-07-20T09:41:34.226+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403183 2016-07-20T09:41:34.227+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403184 2016-07-20T09:41:34.228+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403185 2016-07-20T09:41:39.228+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403186 2016-07-20T09:41:39.229+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403187 2016-07-20T09:41:39.230+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403188 2016-07-20T09:41:42.057+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403189 2016-07-20T09:41:42.057+0800 W NETWORK  [ReplicaSetMonitorWatcher] No primary detected for set shard1
403190 2016-07-20T09:41:43.066+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [192.168.2.69:20001]
403191 2016-07-20T09:41:43.066+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [192.168.254.108:20001]
403192 2016-07-20T09:41:43.067+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [192.168.254.107:20001]
403193 2016-07-20T09:41:43.283+0800 I SHARDING [LockPinger] cluster 192.168.2.69:20001,192.168.254.108:20001,192.168.254.107:20001 pinged successfully at 2016-07-20T09:41:43.068+0800 by distributed lock pinger '192.168.2.69:20001,192.168.254.108:20001,192.168.254.107:20001/bdc9:27023:1467257950:-742828636', sleeping for 30000ms
403194 2016-07-20T09:41:44.230+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403195 2016-07-20T09:41:44.231+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403196 2016-07-20T09:41:44.232+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403197 2016-07-20T09:41:45.057+0800 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
403198 2016-07-20T09:41:45.057+0800 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
403199 2016-07-20T09:41:45.058+0800 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from 192.168.254.108:27023: HostUnreachable: Connection ref       used
403200 2016-07-20T09:41:45.058+0800 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
403201 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 5040
403202 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
403203 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403204 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403205 2016-07-20T09:41:45.060+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403206 2016-07-20T09:41:45.645+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted
……
……
……
403217 2016-07-20T09:41:49.061+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403218 2016-07-20T09:41:49.062+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403219 2016-07-20T09:41:49.063+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403220 2016-07-20T09:41:51.064+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403221 2016-07-20T09:41:51.064+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403222 2016-07-20T09:41:51.065+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403223 2016-07-20T09:41:52.058+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403224 2016-07-20T09:41:53.065+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403225 2016-07-20T09:41:53.066+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403226 2016-07-20T09:41:53.067+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403227 2016-07-20T09:41:55.059+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
403228 2016-07-20T09:41:55.059+0800 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
403229 2016-07-20T09:41:55.059+0800 I REPL     [replExecDBWorker-0] transition to SECONDARY
403230 2016-07-20T09:41:55.064+0800 I REPL     [ReplicationExecutor] Member 192.168.254.107:27023 is now in state SECONDARY
403231 2016-07-20T09:41:55.068+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403232 2016-07-20T09:41:55.068+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403233 2016-07-20T09:41:55.069+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403234 2016-07-20T09:42:00.070+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403235 2016-07-20T09:42:00.071+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403236 2016-07-20T09:42:00.071+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403237 2016-07-20T09:42:02.061+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403238 2016-07-20T09:42:02.061+0800 W NETWORK  [ReplicaSetMonitorWatcher] No primary detected for set shard1
403239 2016-07-20T09:42:05.071+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403240 2016-07-20T09:42:05.072+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403241 2016-07-20T09:42:05.073+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403242 2016-07-20T09:42:06.304+0800 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
403243 2016-07-20T09:42:06.304+0800 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
403244 2016-07-20T09:42:06.305+0800 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from 192.168.254.108:27023: HostUnreachable: Connection refused
403245 2016-07-20T09:42:06.305+0800 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
403246 2016-07-20T09:42:06.306+0800 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 5041
403247 2016-07-20T09:42:06.306+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
403248 2016-07-20T09:42:06.306+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403249 2016-07-20T09:42:06.307+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403250 2016-07-20T09:42:06.307+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403251 2016-07-20T09:42:06.647+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted

Here is the shard status:

shard1:SECONDARY> rs.status()
{
    "set" : "shard1",
    "date" : ISODate("2016-07-20T03:20:49.001Z"),
    "myState" : 2,
    "term" : NumberLong(5326),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "members" : [
        {
            "_id" : 0,
            "name" : "192.168.254.107:27023",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 8,
            "optime" : {
                "ts" : Timestamp(1468879844, 1),
                "t" : NumberLong(267)
            },
            "optimeDate" : ISODate("2016-07-18T22:10:44Z"),
            "lastHeartbeat" : ISODate("2016-07-20T03:20:45.947Z"),
            "lastHeartbeatRecv" : ISODate("2016-07-05T08:38:08.083Z"),
            "pingMs" : NumberLong(0),
            "configVersion" : 1
        },
        {
            "_id" : 1,
            "name" : "192.168.254.108:27023",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : {
                "ts" : Timestamp(0, 0),
                "t" : NumberLong(-1)
            },
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2016-07-20T03:20:45.952Z"),
            "lastHeartbeatRecv" : ISODate("2016-07-18T22:10:46.930Z"),
            "pingMs" : NumberLong(0),
            "lastHeartbeatMessage" : "Connection refused",
            "configVersion" : -1
        },
        {
            "_id" : 2,
            "name" : "192.168.2.69:27023",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 1727777,
            "optime" : {
                "ts" : Timestamp(1468984831, 1),
                "t" : NumberLong(5326)
            },
            "optimeDate" : ISODate("2016-07-20T03:20:31Z"),
            "infoMessage" : "could not find member to sync from",
            "configVersion" : 1,
            "self" : true
        }
    ],
    "ok" : 1,
    "$gleStats" : {
        "lastOpTime" : Timestamp(0, 0),
        "electionId" : ObjectId("7fffffff00000000000014ce")
    }
}

How does it happen?

Here is the configuration of the ReplicaSet, when it is ok:

shard1:PRIMARY> rs.conf()
{
    "_id" : "shard1",
    "version" : 1,
    "protocolVersion" : NumberLong(1),
    "members" : [
        {
            "_id" : 0,
            "host" : "192.168.254.107:27023",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 1,
            "host" : "192.168.254.108:27023",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "192.168.2.69:27023",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 10000,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("577493542ff2d62af240fe4f")
    }
}

Pay attention to the "can't see a majority of the set, relinquishing primary. Stepping down from primary in response to heartbeat" log messages. For some reason ,probably because of the slow network, the node can not reach the majority of nodes in the replica set and becomes a secondary. When the majority established then election is started again. Configuration of heartbeat/election settings may help, refer the documentation docs.mongodb.com/manual/reference/replica-configuration — Andriy Simonov, Jul 20 '16 at 10:16
@AndriySimonov I guess it is too, can I increase the heartbeat time to solve this problem? — Yanhui Zhou, Jul 21 '16 at 02:31

relinquishing primary when one node down in three nodes mongo ReplicaSet

0 Answers0