0

I am using MongoDB version 3.0.14 (with wiredtiger) as 4 member repliate set. I am facing a strange issue in production where suddenly most of the queries start to block on Global.timeAcquiringMicros.r (on Secondary Mongod Server, where the java-client is sending normal read queries using SecondaryPreferred Read Preference)

2020-06-09T12:30:23.959+0000 I COMMAND  [conn210676] query <db>.<collection> query: { $query: { _id: { $gte: "<value>", $lt: "<value>" } }, $orderby: { _id: -1 }, $maxTimeMS: 16000 } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:11 nscannedObjects:11 keyUpdates:0 writeConflicts:0 numYields:6 nreturned:11 reslen:270641 locks:{ Global: { acquireCount: { r: 14 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 4580895 } }, Database: { acquireCount: { r: 7 } }, Collection: { acquireCount: { r: 7 } } } 1706ms

2020-06-09T12:30:25.887+0000 I COMMAND  [conn210607] query <db>.<collection> query: { $query: { _id: { $gte: "<value1>", $lt: "<value2>" } }, $orderby: { _id: -1 }, $maxTimeMS: 16000 } planSummary: IXSCAN { _id: 1 } cursorid:76676946055 ntoreturn:0 ntoskip:0 nscanned:40 nscannedObjects:40 keyUpdates:0 writeConflicts:0 numYields:12 nreturned:40 reslen:1062302 locks:{ Global: { acquireCount: { r: 26 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 21622755 } }, Database: { acquireCount: { r: 13 } }, Collection: { acquireCount: { r: 13 } } } 3639ms

These query are taking the intent shared lock (r) on the Global database and it has to wait for 4580895/21622755 microseconds (according to timeAcquiringMicros). I have the following queries:

  • What is happening that it has to take 4580895/21622755 microseconds? I can assure that there are no activities happening apart from normal delete/update/insert/query based on _id index
  • Since Global.timeAcquiringMicros.r is high, I am assuming there is any operation taking 'R', 'W' lock on Global database, then, what is the way to capture such queries? I have tried db.currentOp() but couldn't find anything.
  • Global.acquireCount=14, why the query has to take Global intent shared lock (r) 14 times?

Explain result of the 2nd query (whose Global.timeAcquiringMicros.r = 21622755):

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "<db>.<coll>",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "_id" : {
                        "$lt" : "<stop>"
                    }
                },
                {
                    "_id" : {
                        "$gte" : "<start>"
                    }
                }
            ]
        },
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "_id" : 1
                },
                "indexName" : "_id_",
                "isMultiKey" : false,
                "direction" : "backward",
                "indexBounds" : {
                    "_id" : [
                        "(\"<stop>\", \"<start>\"]"
                    ]
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 69,
        "executionTimeMillis" : 57,
        "totalKeysExamined" : 69,
        "totalDocsExamined" : 69,
        "executionStages" : {
            "stage" : "FETCH",
            "nReturned" : 69,
            "executionTimeMillisEstimate" : 50,
            "works" : 70,
            "advanced" : 69,
            "needTime" : 0,
            "needFetch" : 0,
            "saveState" : 2,
            "restoreState" : 2,
            "isEOF" : 1,
            "invalidates" : 0,
            "docsExamined" : 69,
            "alreadyHasObj" : 0,
            "inputStage" : {
                "stage" : "IXSCAN",
                "nReturned" : 69,
                "executionTimeMillisEstimate" : 0,
                "works" : 70,
                "advanced" : 69,
                "needTime" : 0,
                "needFetch" : 0,
                "saveState" : 2,
                "restoreState" : 2,
                "isEOF" : 1,
                "invalidates" : 0,
                "keyPattern" : {
                    "_id" : 1
                },
                "indexName" : "_id_",
                "isMultiKey" : false,
                "direction" : "backward",
                "indexBounds" : {
                    "_id" : [
                        "(\"<stop>\", \"<start>\"]"
                    ]
                },
                "keysExamined" : 69,
                "dupsTested" : 0,
                "dupsDropped" : 0,
                "seenInvalidated" : 0,
                "matchTested" : 0
            }
        }
    },
    "serverInfo" : {
        "host" : "<host>",
        "port" : 27017,
        "version" : "3.0.14",
        "gitVersion" : "08352afcca24bfc145240a0fac9d28b978ab77f3"
    },
    "ok" : 1
}

I was able to capture one of the currentOp() when this kind of behaviour happens:

{
            "desc" : "conn225729",
            "threadId" : "0x1321c1400",
            "connectionId" : 225729,
            "opid" : 189970948,
            "active" : false,
            "op" : "getmore",
            "ns" : "<db>.<coll>",
            "query" : {

            },
            "client" : "<client-ip>:55596",
            "numYields" : 0,
            "locks" : {
                "Global" : "r"
            },
            "waitingForLock" : true,
            "lockStats" : {
                "Global" : {
                    "acquireCount" : {
                        "r" : NumberLong(1)
                    },
                    "acquireWaitCount" : {
                        "r" : NumberLong(1)
                    },
                    "timeAcquiringMicros" : {
                        "r" : NumberLong(7500907)
                    }
                }
            }
}
  • Here also the issue is same, a normal query waiting for Global.timeAcquiringMicros.r ("waitingForLock" : true)

Please help and refer to some MongoDB docs which explain the issue. Also, let me know if any other logs are required.

Abhishek Kumar
  • 3,328
  • 2
  • 18
  • 31
  • What are explained queries with execution stats? https://docs.mongodb.com/manual/reference/explain-results/ – D. SM Jun 09 '20 at 16:34
  • @D.SM I have updated my question with the explain result – Abhishek Kumar Jun 10 '20 at 04:15
  • Execution stats are missing. – D. SM Jun 10 '20 at 06:10
  • @D.SM added the execution stats – Abhishek Kumar Jun 10 '20 at 06:44
  • Execution stats say the query completed in 57 ms. You are running a different query or your deployment is in a different state (load, etc.). – D. SM Jun 10 '20 at 07:02
  • I have run the query on the same deployment state (no of documents have increased when I ran the query now). Yes, the load will differ as this is a production setup. But, my query is more towards WiredTiger locks behavior, I don't see a reason for huge `Global.timeAcquiringMicros.r` – Abhishek Kumar Jun 10 '20 at 07:11

1 Answers1

2

A MongoDB secondary node retrieves operation log events from the primary in batches. When it applies a batch of oplog events, it takes a global exclusive write lock (W).

A read intent lock(r) is mutually exclusive with the W lock.

This means that writes and reads must interleave on the secondary nodes, so heavy writes can block reads, and heavy reads can delay replication.

Non-blocking secondary reads was a major feature of MongoDB 4.0 a couple of years ago.

If you can upgrade, that specific lock contention should not happen anymore.

Joe
  • 25,000
  • 3
  • 22
  • 44
  • Your answer makes complete sense. But we were facing a strange issue where one of Secondary (deployed on AWS server) was showing this delay behavior (and read/write requests were getting queued a lot). We stopped that server and everything is working fine. Thanks for suggesting to move to 4.0 so to have non-blocking read in Secondary. – Abhishek Kumar Jun 11 '20 at 08:12