0

I have 2 servers (productive and bi) with 2 mongo services each one, all of them in replica set. Today the vlan went down, so servers are not visible one each other. The problem is that mongodb didn't picked any member to become primary and I don't know how to force one of them to become primary.

I've tried to restart the server with no success, and also to reconfig the replicaset to change priorities, but in order to do so it needs to be done from a primary node, but I can't connect to a primary... I'm really stucked...

I've also read this question https://stackoverflow.com/a/59851668/513570 but I'm not sure if I understand it well, I expected that when there is a problem in a primary, some of the secondary nodes would be picked to become primary, but of course it didn't happened. How could I config the 4 nodes to do so?

So 2 questions here: how can I force one of the secondary members to become primary? And how to configure a replicaset to always have a primary online? Please any help will be really appreciated.

rs.status():

{
  set: 'repset',
  date: ISODate("2022-04-07T08:22:14.569Z"),
  myState: 2,
  term: Long("6"),
  syncSourceHost: '',
  syncSourceId: -1,
  heartbeatIntervalMillis: Long("2000"),
  majorityVoteCount: 3,
  writeMajorityCount: 3,
  votingMembersCount: 4,
  writableVotingMembersCount: 4,
  optimes: {
    lastCommittedOpTime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
    lastCommittedWallTime: ISODate("1970-01-01T00:00:00.000Z"),
    appliedOpTime: { ts: Timestamp({ t: 1649214911, i: 2 }), t: Long("6") },
    durableOpTime: { ts: Timestamp({ t: 1649214911, i: 2 }), t: Long("6") },
    lastAppliedWallTime: ISODate("2022-04-06T03:15:11.013Z"),
    lastDurableWallTime: ISODate("2022-04-06T03:15:11.013Z")
  },
  lastStableRecoveryTimestamp: Timestamp({ t: 1649214911, i: 2 }),
  members: [
    {
      _id: 0,
      name: 'productive.vlan.local:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 418,
      optime: { ts: Timestamp({ t: 1649214911, i: 2 }), t: Long("6") },
      optimeDate: ISODate("2022-04-06T03:15:11.000Z"),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 13,
      configTerm: 6,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 1,
      name: 'productive.vlan.local:37017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 410,
      optime: { ts: Timestamp({ t: 1649214911, i: 2 }), t: Long("6") },
      optimeDurable: { ts: Timestamp({ t: 1649214911, i: 2 }), t: Long("6") },
      optimeDate: ISODate("2022-04-06T03:15:11.000Z"),
      optimeDurableDate: ISODate("2022-04-06T03:15:11.000Z"),
      lastHeartbeat: ISODate("2022-04-07T08:22:14.565Z"),
      lastHeartbeatRecv: ISODate("2022-04-07T08:22:14.305Z"),
      pingMs: Long("0"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 13,
      configTerm: 6
    },
    {
      _id: 2,
      name: 'bi.vlan.local:37017',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
      optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
      lastHeartbeat: ISODate("2022-04-07T08:22:09.867Z"),
      lastHeartbeatRecv: ISODate("1970-01-01T00:00:00.000Z"),
      pingMs: Long("0"),
      lastHeartbeatMessage: 'Error connecting to bi.vlan.local:37017 (10.0.130.209:37017) :: caused by :: No route to host',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: -1,
      configTerm: -1
    },
    {
      _id: 3,
      name: 'bi.vlan.local:47017',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
      optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
      lastHeartbeat: ISODate("2022-04-07T08:22:09.867Z"),
      lastHeartbeatRecv: ISODate("1970-01-01T00:00:00.000Z"),
      pingMs: Long("0"),
      lastHeartbeatMessage: 'Error connecting to bi.vlan.local:47017 (10.0.130.209:47017) :: caused by :: No route to host',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: -1,
      configTerm: -1
    }
  ],
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1649214911, i: 2 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1649214911, i: 2 })
}
Miquel
  • 8,339
  • 11
  • 59
  • 82
  • 1
    A ReplicaSet can elect the primary only when the **majority** of all members are reachable. 2 out of 4 is not the majority! How did you try to reconfigure the replica set? Did you set `force: true`? – Wernfried Domscheit Apr 07 '22 at 09:13
  • Ok, didn't tried with `force: true`. Finally I opened the ports via external IP and then could elect primary again. Also I've changed the priorities of each member so local members have higher priority... So from your response I understand I have 2 solutions: 1) add one local member 2) remove one remote member... Is there any other option, like to give higher weigh lo local members in the election so 2/4 makes majority? – Miquel Apr 07 '22 at 09:41
  • 1
    Not really clear what you mead, perhaps have a look at this: https://stackoverflow.com/questions/69658590/mongoclient-to-connect-to-multiple-hosts-to-handle-failover/69666511#69666511 – Wernfried Domscheit Apr 07 '22 at 11:04
  • Wow, this is a very good explanation... I've been reading this: https://www.mongodb.com/docs/manual/core/replica-set-arbiter/ In that case, I think I should add an arbiter in the production server, in this way there would always be a primary available in production in case of network failure, is that correct? If this is the case I would add a comment on your response to note that arbiters can change the voting process... – Miquel Apr 07 '22 at 16:40
  • Yes, an arbiter is typically useful when you have an even number of replica set members. In best case, the arbiter is configured at a different location in your network. Note, the arbiter does not store any data and does almost nothing. A tiny machine is sufficient for it. – Wernfried Domscheit Apr 07 '22 at 17:19

1 Answers1

0

Thanks to @Wernfried-Domscheit comment, the problem is an election cannot be done see this question

As the bi server only serves as backups and BI operations, I've ended up by changing the votes and priority of the members in the bi server:

  members: [
    {
      _id: 0,
      host: 'productive.vlan.local:27017',
      priority: 100,
      votes: 1
    },
    {
      _id: 1,
      host: 'productive.vlan.local:37017',
      priority: 10,
      votes: 1
    },
    {
      _id: 2,
      host: 'bi.vlan.local:37017',
      priority: 0,
      votes: 0
    },
    {
      _id: 3,
      host: 'bi.vlan.local:47017',
      priority: 0,
      votes: 0
    }
  ],

Another option would be to add an arbiter but in this situation it is not required at all...

Miquel
  • 8,339
  • 11
  • 59
  • 82
  • To my mind, this not really fault tolerant. You need at least three different server locations (and an fifth nodes if you want to support the crash of two nodes) – godo57 Apr 08 '22 at 11:53
  • @mcanzerini Not really sure on what you mean: with this configuration, if any of the `bi` members goes down or both of them goes down, the service will remain up&working, as majority is still there. If I'm right, the problem would be if one of the two members in `productive` goes down, in that case service will go down. Is that correct? – Miquel Apr 08 '22 at 16:11