4

I have a replica set that I am trying to upgrade the primary to one with more memory and upgraded disk space. So I raided a couple disks together on the new primary, rsync'd the data from a secondary and added it to the replica set. After checking out the rs.status(), I noticed that all the secondaries are at about 12 hours behind the primary. So when I try to force the new server to the primary spot it won't work, because it is not up to date.

This seems like a big issue, because in case the primary fails, we are at least 12 hours and some almost 48 hours behind.

The oplogs all overlap and the oplogsize is fairly large. The only thing that I can figure is I am performing a lot of writes/reads on the primary, which could keep the server in lock, not allowing for proper catch up.

Is there a way to possibly force a secondary to catch up to the primary?

There are currently 5 Servers with last 2 are to replace 2 of the other nodes. The node with _id as 6, is to be the one to replace the primary. The node that is the furthest from the primary optime is a little over 48 hours behind.

{
"set" : "gryffindor",
"date" : ISODate("2011-05-12T19:34:57Z"),
"myState" : 2,
"members" : [
    {
        "_id" : 1,
        "name" : "10******:27018",
        "health" : 1,
        "state" : 2,
        "stateStr" : "SECONDARY",
        "uptime" : 20231,
        "optime" : {
            "t" : 1305057514000,
            "i" : 31
        },
        "optimeDate" : ISODate("2011-05-10T19:58:34Z"),
        "lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
    },
    {
        "_id" : 2,
        "name" : "10******:27018",
        "health" : 1,
        "state" : 2,
        "stateStr" : "SECONDARY",
        "uptime" : 20231,
        "optime" : {
            "t" : 1305056009000,
            "i" : 400
        },
        "optimeDate" : ISODate("2011-05-10T19:33:29Z"),
        "lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
    },
    {
        "_id" : 3,
        "name" : "10******:27018",
        "health" : 1,
        "state" : 1,
        "stateStr" : "PRIMARY",
        "uptime" : 20229,
        "optime" : {
            "t" : 1305228858000,
            "i" : 422
        },
        "optimeDate" : ISODate("2011-05-12T19:34:18Z"),
        "lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
    },
    {
        "_id" : 5,
        "name" : "10*******:27018",
        "health" : 1,
        "state" : 2,
        "stateStr" : "SECONDARY",
        "uptime" : 20231,
        "optime" : {
            "t" : 1305058009000,
            "i" : 226
        },
        "optimeDate" : ISODate("2011-05-10T20:06:49Z"),
        "lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
    },
    {
        "_id" : 6,
        "name" : "10*******:27018",
        "health" : 1,
        "state" : 2,
        "stateStr" : "SECONDARY",
        "optime" : {
            "t" : 1305050495000,
            "i" : 384
        },
        "optimeDate" : ISODate("2011-05-10T18:01:35Z"),
        "self" : true
    }
],
"ok" : 1
}
Bryan
  • 245
  • 2
  • 7

2 Answers2

2

After looking through everything I saw a single error, which led me back to a mapreduce that was run on the primary, which had this issue: https://jira.mongodb.org/browse/SERVER-2861 . So when replication was attempted it failed to sync because of a faulty/corrupt operation in the oplog.

Bryan
  • 245
  • 2
  • 7
0

To answer the original question (which wouldn't have solved the OP's problem), I believe that the best way to force a secondary to "catch up" would be to remove it from the set and re-add it, but chances are (such as in this case), there are other problems. Check your logs.

gWaldo
  • 11,957
  • 8
  • 42
  • 69