MongoDB Primary fails to come back

Question

I have a MongoDB 3 member replica set running on Windows. When the primary server (S1) goes down, the secondary is elected correctly. When the primary server comes back up, the replica member stays in an invalid state:

     {
            "state" : 10,
            "stateStr" : "REMOVED",
            "uptime" : 111,
            "optime" : Timestamp(1448462710, 6),
            "optimeDate" : ISODate("2015-11-25T14:45:10Z"),
            "ok" : 0,
            "errmsg" : "Our replica set config is invalid or we are not a member of it",
            "code" : 93
     }

After that, the secondary, keeps switching between primary and secondary every few seconds, making my application unstable.

The only way to bring the primary server back is by doing a rs.reconfig(c).

I couldn't find anything wrong with the config files.

Any help will be appreciated.

UPDATE: Here's the current config:

{
    "_id" : "companyName",
    "version" : 32593,
    "protocolVersion" : NumberLong(1),
    "members" : [
            {
                    "_id" : 1,
                    "host" : "arb.companyName.com:40000",
                    "arbiterOnly" : true,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 1,
                    "tags" : {

                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            },
            {
                    "_id" : 2,
                    "host" : "m3.companyName.com:40000",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 11,
                    "tags" : {

                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            },
            {
                    "_id" : 4,
                    "host" : "m2.companyName.com:40000",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 3,
                    "tags" : {

                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            }
    ],
    "settings" : {
            "chainingAllowed" : true,
            "heartbeatIntervalMillis" : 2000,
            "heartbeatTimeoutSecs" : 10,
            "electionTimeoutMillis" : 10000,
            "getLastErrorModes" : {

            },
            "getLastErrorDefaults" : {
                    "w" : 1,
                    "wtimeout" : 0
            },
            "replicaSetId" : ObjectId("573dfcd0e8ae6154ff80c50d")
    }
}

Should I be using IP addresses rather than host names?

UPDATE 2:

This is the log for the primary (m3.companyName.com - IP 1.1.1.1) from when it was rebooted, until it I went into the other server (m2.companyName.com - IP 2.2.2.2) and did a manual rs.reconfig().

2016-09-06T07:42:05.953Z I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-09-06T07:42:05.953Z I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'c:/mongossl/data3/diagnostic.data'
2016-09-06T07:42:05.954Z I NETWORK  [initandlisten] waiting for connections on port 40000 ssl
2016-09-06T07:42:05.955Z W NETWORK  [ReplicationExecutor] getaddrinfo("arb.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.955Z I NETWORK  [ReplicationExecutor] getaddrinfo("arb.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.957Z W NETWORK  [ReplicationExecutor] getaddrinfo("m3.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.957Z I NETWORK  [ReplicationExecutor] getaddrinfo("m3.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.958Z W NETWORK  [ReplicationExecutor] getaddrinfo("m2.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.959Z I NETWORK  [ReplicationExecutor] getaddrinfo("m2.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.959Z W REPL     [ReplicationExecutor] Locally stored replica set configuration does not have a valid entry for the current node; waiting for reconfig or remote heartbeat; Got "NodeNotFound: No host described in new configuration 32592 for replica set companyName2 maps to this node" while validating { _id: "companyName2", version: 32592, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] New replica set config in use: { _id: "companyName2", version: 32592, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] This node is not a member of the config
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] transition to REMOVED
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] Starting replication applier threads
2016-09-06T07:42:06.651Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53746 #1 (1 connection now open)
2016-09-06T07:42:06.760Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53747 #2 (2 connections now open)
2016-09-06T07:42:06.864Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53748 #3 (3 connections now open)
2016-09-06T07:42:06.993Z I ACCESS   [conn1]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.067Z I ACCESS   [conn2]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.159Z I ACCESS   [conn3]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.552Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:07.627Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:08.975Z I NETWORK  [conn1] end connection 2.2.2.2:53746 (2 connections now open)
2016-09-06T07:42:08.975Z I NETWORK  [conn2] end connection 2.2.2.2:53747 (2 connections now open)
2016-09-06T07:42:08.975Z I NETWORK  [conn3] end connection 2.2.2.2:53748 (2 connections now open)
2016-09-06T07:42:09.371Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53763 #4 (1 connection now open)
2016-09-06T07:42:09.639Z I ACCESS   [conn4]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:13.059Z I NETWORK  [initandlisten] connection accepted from 3.3.3.3:58220 #5 (2 connections now open)
2016-09-06T07:42:13.127Z I ACCESS   [conn5]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=arb.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:13.292Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to arb.companyName.com:40000
2016-09-06T07:42:13.301Z I REPL     [ReplicationExecutor] Member arb.companyName.com:40000 is now in state ARBITER
2016-09-06T07:42:13.974Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53765 #6 (3 connections now open)
2016-09-06T07:42:14.433Z I ACCESS   [conn6] Successfully authenticated as principal appUser on companyName
2016-09-06T07:42:16.629Z I NETWORK  [initandlisten] connection accepted from 1.1.1.13:49162 #7 (4 connections now open)
2016-09-06T07:42:16.853Z I ACCESS   [conn7] Successfully authenticated as principal appUser on companyName
2016-09-06T07:42:17.703Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:17.703Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:18.131Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:18.206Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:42:23.369Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53767 #8 (5 connections now open)
2016-09-06T07:42:23.832Z I ACCESS   [conn8] Successfully authenticated as principal sa on admin
2016-09-06T07:42:28.356Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:38.431Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:38.431Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:38.861Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:38.936Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:42:49.086Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:59.161Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:59.161Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:59.590Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:59.665Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:09.814Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:19.889Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:43:19.889Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:43:20.317Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:43:20.392Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:30.542Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:34.054Z I NETWORK  [initandlisten] connection accepted from 1.1.1.13:49188 #9 (6 connections now open)
2016-09-06T07:43:34.106Z I ACCESS   [conn9] Successfully authenticated as principal sa on admin
2016-09-06T07:43:40.617Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:43:40.617Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:43:41.045Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:43:41.120Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:51.270Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:51.277Z I NETWORK  [initandlisten] connection accepted from 1.1.1.13:49193 #10 (7 connections now open)
2016-09-06T07:43:51.339Z I ACCESS   [conn10] Successfully authenticated as principal sa on admin
2016-09-06T07:44:01.346Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:01.346Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:01.775Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:01.850Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:12.001Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:44:22.077Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:22.077Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:22.506Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:22.582Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:32.732Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:44:42.807Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:42.807Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:43.237Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:43.312Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:53.462Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:03.537Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:03.537Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:03.966Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:04.041Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:14.191Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:24.266Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:24.266Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:24.700Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:24.775Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:34.925Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:45.000Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:45.000Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:45.428Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:45.504Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:55.654Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:05.729Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:46:05.729Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:46:06.157Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:06.232Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:46:16.382Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:26.458Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:46:26.458Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:46:26.889Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:26.964Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:46:37.115Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:43.185Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53847 #11 (8 connections now open)
2016-09-06T07:46:43.392Z I ACCESS   [conn11]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:46:43.541Z I NETWORK  [conn11] end connection 2.2.2.2:53847 (7 connections now open)
2016-09-06T07:46:44.370Z I NETWORK  [initandlisten] connection accepted from 3.3.3.3:58224 #12 (8 connections now open)
2016-09-06T07:46:44.434Z I ACCESS   [conn12]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=arb.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:46:44.451Z I NETWORK  [conn12] end connection 3.3.3.3:58224 (7 connections now open)
2016-09-06T07:46:47.832Z I REPL     [ReplicationExecutor] New replica set config in use: { _id: "companyName2", version: 32593, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:46:47.832Z I REPL     [ReplicationExecutor] This node is m3.companyName.com:40000 in the config
2016-09-06T07:46:47.832Z I REPL     [ReplicationExecutor] transition to STARTUP2
2016-09-06T07:46:47.907Z I REPL     [ReplicationExecutor] Scheduling priority takeover at 2016-09-06T03:46:57.907-0400
2016-09-06T07:46:48.040Z I REPL     [ReplicationExecutor] syncing from: m2.companyName.com:40000
2016-09-06T07:46:48.545Z I REPL     [SyncSourceFeedback] setting syncSourceFeedback to m2.companyName.com:40000
2016-09-06T07:46:48.977Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:50.983Z I REPL     [ReplicationExecutor] transition to RECOVERING
2016-09-06T07:46:50.985Z I REPL     [ReplicationExecutor] transition to SECONDARY
2016-09-06T07:46:51.438Z I REPL     [ReplicationExecutor] could not find member to sync from
2016-09-06T07:46:57.907Z I REPL     [ReplicationExecutor] Canceling priority takeover callback
2016-09-06T07:46:57.907Z I REPL     [ReplicationExecutor] Starting an election for a priority takeover
2016-09-06T07:46:57.907Z I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-06T07:46:57.916Z I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2016-09-06T07:46:57.925Z I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 244
2016-09-06T07:46:57.925Z I REPL     [ReplicationExecutor] transition to PRIMARY
2016-09-06T07:46:58.345Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:58.362Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:58.440Z I REPL     [rsSync] transition to primary complete; database writes are now permitted

The most obvious thing I noticed is the "No such host is known" error. Maybe Mongo is trying to start before Windows can resolve the names?

Can you describe how the three members are configured? You refer to one of them as "the primary" and another as "the secondary", but then you say that the config files are basically identical? — Vince Bowdren, Sep 06 '16 at 08:25
Please at primary give command rs.conf() and copy result here. That error message indicate that your machines hostname or IP-address has changed during "downtime". — JJussi, Sep 06 '16 at 16:44
@CodeCaster please ignore that comment. I was meaning to say that the config files on each server are identical except for the server specific information (db path and ssl cert path). — perfect_element, Sep 07 '16 at 05:18
There few things what are strange.. (this can be solution) Error messages time stamp "optimeDate" : ISODate("2015-11-25T14:45:10Z"), it's from last year. Replication don't work if clock is at wrong time. At your rs.config there is "replicaSetId" : ObjectId("573dfcd0e8ae6154ff80c50d").. Normal config don't have such value — JJussi, Sep 07 '16 at 05:38
@JJussi the optimeDate is not from my actual error message, I got it from another file because by the time I was writing this post I just had a screenshot of the error. All other information is accurate though. I'll investigate the replicaSetId. I don't know where that's coming from. — perfect_element, Sep 07 '16 at 05:51
Yes, you are right! Mongo is starting "too early". What would happened IF you restart mongo process (after DNS is working) when that other machine is ping-pong state? Or can you delay startup of mongo? Maybe turn off autostart and manually start mongo when machine and network is ready. That replicaSetId have time stamp: Thu, 19 May 2016 17:50:08 GMT if that information helps to find out where it is coming from... — JJussi, Sep 07 '16 at 15:26
..and of course if you use IP addresses in the config file.. Machine don't need DNS and probably will work.. Or you can write those addresses to hosts file (under /windows/system32/drivers/etc/ directory) — JJussi, Sep 07 '16 at 15:32
I confirmed that this was the problem. When I change the MongoDB's Windows Service from Automatic to Automatic (Delayed Start), everything works as expected. — perfect_element, Sep 10 '16 at 08:39
I added my suggestion as answer, so you can mark it as right answer.. Thank you! — JJussi, Sep 15 '16 at 06:04

score 2 · Accepted Answer · answered Sep 15 '16 at 06:04

2

Please delay startup of mongo. This will resolve this issue.

answered Sep 15 '16 at 06:04

JJussi

1,540
12
12

1

I am having the same issue myself and I am curious why delaying the startup of Mongo will work? What is the reason for the current issue? – siddhu Jun 25 '19 at 20:01
Delay it until what happens? – David Resnick May 02 '22 at 14:56
If you read whole thread, you can see that mongod process started before DNS service. Delayded autostart fixed the problem. – JJussi May 04 '22 at 06:43

score 0 · Answer 2 · answered Dec 01 '17 at 19:25

I got the same problem when I tried to replace a secondary from a backup. The problem was I started the mongod process in the backup server before it was reachable to the replica set (before switching from the old server to the new [from backup] server). After restarting the mongod process the problem was solved.

My suggestion is to start the mongod process only after it is reachable to the replica set it should belong to.

MongoDB Primary fails to come back

2 Answers2