MongoDB secondary completely unable to keep up

Question

We have set up a three member replica set consisting of the following, all on MongoDB version 3.4:

Primary. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
Secondary. Amazon EC2, Windows Server 2016, m4.2xlarge, 32 GB RAM, 8 vCPUs. Hosted in Germany.
Arbiter. Tiny cloud based Linux instance.

The problem we are seeing is that the secondary is unable to keep up with the primary. As we seed it with data (copy over from the primary) and add it to the replica set, it typically manages to get in sync, but an hour later it might lag behind by 10 minutes; a few hours later, it's an hour behind, and so on, until a day or two later, it goes stale.

We are trying to figure out why this is. The primary is consistently using 0-1% CPU, while the secondary is under constant heavy load at 20-80% CPU. This seems to be the only potential resource constraint. Disk and network load does not seem to be an issue. There seems to be some locking going on on the secondary, as operations in the mongo shell (such as db.getReplicationInfo()) frequently takes 5 minutes or more to complete, and mongostat rarely works (it just says i/o timeout). Here is output from mongostat during a rare instance when it reported stats for the secondary:

                   host insert query update delete getmore command dirty  used flushes vsize   res qrw  arw net_in net_out conn set repl                time
        localhost:27017     *0    33    743     *0       0   166|0  1.0% 78.7%       0 27.9G 27.0G 0|0  0|1  2.33m    337k  739  rs  PRI Mar 27 14:41:54.578
  primary.XXX.com:27017     *0    36    825     *0       0   131|0  1.0% 78.7%       0 27.9G 27.0G 0|0  0|0  1.73m    322k  739  rs  PRI Mar 27 14:41:53.614
secondary.XXX.com:27017     *0    *0     *0     *0       0   109|0  4.3% 80.0%       0 8.69G 7.54G 0|0 0|10  6.69k    134k  592  rs  SEC Mar 27 14:41:53.673

I ran db.serverStatus() on the secondary, and compared to the primary, and one number that stood out was the following:

"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")

The secondary had an uptime of 14000 seconds at the time.

Would appreciate any ideas on what this could be, or how to debug this issue! We could upgrade the Amazon instance to something beefier, but we've done that three times already, and at this point we figure that something else must be wrong.

I'll include output from db.currentOp() on the secondary below, in case it helps. (That command took 5 minutes to run, after which the following was logged: Restarting oplog query due to error: CursorNotFound: Cursor not found, cursor id: 15728290121. Last fetched optime (with hash): { ts: Timestamp 1490613628000|756, t: 48 }[-5363878314895774690]. Restarts remaining: 3)

"desc":"conn605",
"connectionId":605,"client":"127.0.0.1:61098",
"appName":"MongoDB Shell",
"secs_running":0,
"microsecs_running":NumberLong(16),
"op":"command",
"ns":"admin.$cmd",
"query":{"currentOp":1},
"locks":{},
"waitingForLock":false,
"lockStats":{}

"desc":"repl writer worker 10",
"secs_running":0,
"microsecs_running":NumberLong(14046),
"op":"none",
"ns":"CustomerDB.ed2112ec779f",
"locks":{"Global":"W","Database":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong(1)}}}

"desc":"ApplyBatchFinalizerForJournal",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}

"desc":"ReplBatcher",
"secs_running":11545,
"microsecs_running":NumberLong("11545663961"),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog":{"acquireCount":{"r":NumberLong(1)}}}

"desc":"rsBackgroundSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545281690"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong(2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}

"desc":"TTLMonitor",
"op":"none",
"ns":"",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(341534123)}},"Database":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}}

"desc":"SyncSourceFeedback",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}

"desc":"WT RecordStoreThread: local.oplog.rs",
"secs_running":1163,
"microsecs_running":NumberLong(1163137036),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong(1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}}

"desc":"rsSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545663926"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{"Global":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564)},"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}

"desc":"clientcursormon",
"secs_running":0,
"microsecs_running":NumberLong(15649),
"op":"none",
"ns":"CustomerDB.b72ac80177ef",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(397538606)}},"Database":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}],"ok":1}

What kind of [writeConcern](https://docs.mongodb.com/manual/reference/write-concern/) and [readConcern](https://docs.mongodb.com/manual/reference/read-concern/) do your queries usually have? — Vince Bowdren, Mar 27 '17 at 16:18
From your "used" (cache used-%) value (~80%), I can see that mongo could use more memory. (your active data with indexes is probably bigger than memory available). It's not (that) bad if you have very low latency disks in use (Amazon EBS SSD), mongo is VERY sensitive to disk latency and "too low" IOPS. @Vince Bowdren: Write/ReadConcern don't have effect to replicationLag. — JJussi, Mar 28 '17 at 04:43

score 3 · Answer 1 · answered Mar 29 '17 at 13:41

JJussi was exactly right (thanks!). The problem was that the active data is bigger than the memory available, and we were using Amazon EBS "Throughput Optimized HDD". We changed this to "General Purpose SSD" and the problem immediately went away. We were even able to downgrade the server from m4.2xlarge to m4.large.

We were confused by the fact that this manifested as high CPU load. We thought disk load was not an issue based on the fairly low amount of data written to disk per second. But when we tried making the AWS server the primary instance, we noticed that there was a very strong correlation between high CPU load and disk queue length. Further testing on the disk showed that it had very bad performance for the type of traffic that Mongo has.

You probably had a lot of time spent in iowait which isn't the same thing as normal cpu use. — D. SM, Dec 02 '20 at 00:17

MongoDB secondary completely unable to keep up

1 Answers1