Why does the "distinct" and "count" commands happen so slowly on indexed items in MongoDB?

Question

I am using MongoDB (Atlas) to hold my ~55 million time-series documents in a database

The documents typically look like this:

{
  "_id" :          ObjectId("5c8f1fb127fcc613f422677d"),
  "capture_time" : ISODate("2019-03-18T04:33:51Z"),
  "key" :          "9727eec91/51c922a.png",
  "e_id" :         20702
}

Not all documents have an 'e_id', so I have indexed "e_id" like so:

{
    "v" : 2,
    "key" : {
        "e_id" : 1
    },
    "name" : "e_id_1",
    "ns" : "events.detection_events",
    "partialFilterExpression" : {
        "e_id" : {
            "$exists" : true
        }
    }
}

As you can see I've got a partial index on the "e_id".

When I run db.coll.count({"e_id": {$exists:true}}), the response times out and I get warnings about disk IO reaching 90%

The same thing happens when I run db.coll.distinct("e_id", {"e_id": {$exists:true}}).

If I run db.coll.find({"e_id": {$exists:true}}) it runs reasonable quickly.

Do I just have too many Documents in this collection to be able to count/distinct them based on "e_id"?

EDIT

Here is the .explain() from the db.coll.find({"e_id": {$exists: true}}) query

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "db.coll",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "e_id" : {
                "$exists" : true
            }
        },
        "winningPlan" : {
            "stage" : "FETCH",
            "filter" : {
                "e_id" : {
                    "$exists" : true
                }
            },
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "e_id" : 1
                },
                "indexName" : "emp_id_1",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "e_id" : [ ]
                },
                "isUnique" : false,
                "isSparse" : false,
                "isPartial" : true,
                "indexVersion" : 2,
                "direction" : "forward",
                "indexBounds" : {
                    "e_id" : [
                        "[MinKey, MaxKey]"
                    ]
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "serverInfo" : {
        "host" : "cluster0.mongodb.net",
        "port" : 29017,
        "version" : "4.0.12",
        "gitVersion" : "5776e76dbf9e7afe86e6b39g22520ffb6766e95d4"
    },
    "ok" : 1,
    "operationTime" : Timestamp(1571177083, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1571177083, 2),
        "signature" : {
            "hash" : BinData(0,"fvD6+eR72y83kO15Tk3TAmZtn5I="),
            "keyId" : NumberLong("6728051106256797217")
        }
    }
}

You have to run the [explain](https://docs.mongodb.com/manual/reference/command/explain/index.html) on your query and see what is happening in the resulting **query plan** (e.g., if your query is using the index, etc.). — prasad_, Oct 15 '19 at 04:35
@prasad_ I have run the explain on the `find` query, and it appears to be using the index (unless I'm reading the results wrong) I guess that just means I have too many documents for a `distinct` or `count` command to run in any reasonable time... — A_toaster, Oct 15 '19 at 22:09
I think the query plan needs to be run with the "executionStats" **mode**. As the mode says, the output will additionally have the statistics for the plans. There will be some details which might be helpful - details like execution time, documents examined / returned, etc. — prasad_, Oct 16 '19 at 01:42
Also, you need to run the query plans for the queries which are running slow. — prasad_, Oct 16 '19 at 01:47
I created a similar collection with fewer documents ( about 1.5 million) and ran the queries and query plans. With and without the partial index. The plans showed that the execution time is better (faster) without the index. — prasad_, Oct 16 '19 at 02:48
@prasad_ Wow that is impressive. I would've thought that querying on something indexed (be it partial or not) would always be many times faster! — A_toaster, Oct 16 '19 at 03:03
Also, try using **aggregation queries** for both distinct and count. These might give you a better performance. You can also try the query plans, the `{ allowDiskUse: true }` options, and of course with /without the index - on the aggregation queries Here is the sample code:... — prasad_, Oct 16 '19 at 03:13
**Distinct**: `db.time_series.aggregate( [ { $match: { eid: { $exists: true } } }, { $group : { _id : "$eid" } } ] )` _and_ **Count**: `db.time_series.aggregate( [ { $match: { eid: { $exists: true } } }, { $group : { _id : null, count: { $sum: 1 } } } , { $project: { _id: 0 } } ] )`. The **explain** is run like this: `db.time_series.explain("executionStats").aggregate( [ ...` — prasad_, Oct 16 '19 at 03:14
**Query performance** depends upon various factors, including amount of data, indexes, how the query is built, _and_ the hardware considerations like processor, hard drive, RAM and of course the network. In case there is other load on the server, it also affects your currently running query. — prasad_, Oct 16 '19 at 03:21
@prasad_ Thanks for all your help! If you form your comments into an answer I will give you the green tick! — A_toaster, Oct 16 '19 at 04:05
You are welcome. I will post the details as answer, little later :) — prasad_, Oct 16 '19 at 04:08

prasad_ · Accepted Answer · 2019-10-16T07:16:50.657

Indexes are a vital part of a good query performance. This applies to aggregation queries, also.

In general, query performance depends upon various factors, including amount of data, document size, indexes, how the query is built, and the hardware considerations like processor, hard drive, RAM and of course the network. In case there is other load on the server, it also affects your currently running query.

Slow query? First check if there are there any indexes defined on the collection. Still slow, run a query plan using the explain(). Use the option / mode "executionStats". The query plan document will show, the winning query plan and the execution statistics for that plan. The plan document shows if indexes are used (IXSCAN) or a collection scan (COLLSCAN) and stages; if index is used what index is used. The execution statistics show the number of documents returned, execution time, total index keys examined and documents examined, etc., at different stages.

In the above question scenario, the number of documents in the collections mattered (50+ million). In case you create an index on a large number keys it is going to be a large sized index; and this can be difficult load into memory. If the index cannot be in the memory, there will be disk usage and the query will be slow.

The queries, the first returns the count and the second one distinct values of eid field:

db.time_series.count( { eid: { $exists : true } } );
db.coll.distinct( eid, { eid: {$exists: true} } )

I tried the above queries with a sampling of similar data of 1.2 million documents. I ran the query plan with stats for the count and the distinct queries. The execution stats are as follows on my machine:

"nReturned" : 0,
"executionTimeMillis" : 778,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,

"nReturned" : 1144845,
"executionTimeMillis" : 775,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,

I created the partial index on the eid field, as in the issue.

db.time_series.createIndex( { eid: 1 }, { partialFilterExpression: { eid: { $exists: true } } } )

db.time_series.getIndexes() shows the newly created index:
[
...
        {
                "v" : 2,
                "key" : {
                        "eid" : 1
                },
                "name" : "eid_1",
                "ns" : "test.time_series",
                "partialFilterExpression" : {
                        "eid" : {
                                "$exists" : true
                        }
                }
        }
]

The query plan showed that the index was used. The statistics:

"nReturned" : 0,
"executionTimeMillis" : 4278,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,

"nReturned" : 1144845,
 "executionTimeMillis" : 4409,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,

The executionTimeMillis shown was much higher after the index was created.

Also, note the index was used (there is the query planner stage with IXSCAN). But, it turns out it is of no use. The index keys examined is a high number and is same as the returned documents.

Using Aggregation Queries:

Aggregation queries allow process data in stages. That means you can have more control on the query and how it is processed. Also, aggregation queries can use indexes like in the find queries, and little differently. Once, the pipeline reaches certain stages the indexes will not be used; so make sure appropriate stage is used first that makes use of index (sometimes the query optimizer might do that). These can be noted in the query plan.

The aggregation pipeline for the distinct and count queries.

db.time_series.aggregate( [ 
  { $match: { eid: { $exists: true } } }, 
  { $group : { _id : "$eid" } },
  { $project: { eid: "$_id", _id: 0 } }
] )

db.time_series.aggregate( [ 
  { $match: { eid: { $exists: true } } }, 
  { $group : { _id : null, count: { $sum: 1 } } } ,
  { $project: { _id: 0 } }
] )

In this case also, I tried with and without the index on the eid field. The explain can be used like this on the aggregation query:

db.time_series.explain("executionStats").aggregate( [ ... ] )

The explain stats with and without index:

"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 4813,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,

"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 8322,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,

"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 1043,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,

"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 3884,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,

Note on Memory Constraints in Aggregation:

The resulting documents are subject to the 16 MB limit. The limit doesn't apply to the document size as it flows thru the pipeline (that is in the intermediate stages). Note the resulting document size can be controlled using the $limit and $project stages.
Each stage has a 100 MB RAM/memory usage. The best way to control this by using indexes with the larger stages (this is because the indexes are smaller than the documents they reference). Use the {allowDiskUse : true } option with the aggregation query. This should be the last resort as there will be a degradation in performance when this option is used; this is expected as the h/w access is much slower to work with as the work will be spilled to the disk. This is more often used in batch processing.

Conclusion:

Achieving the right query performance is mostly a science. There are quite a few tools for this, like the explain / query planner and the indexes. A bit of trial-and-error runs with various options on sample documents is required too. It is also a math; size of document and the number of documents tell how many bytes of memory the documents and the indexes use.

References:

Indexes: https://docs.mongodb.com/manual/indexes/index.html
explain: https://docs.mongodb.com/manual/reference/command/explain/index.html
Explain results: https://docs.mongodb.com/manual/reference/explain-results/index.html
count: https://docs.mongodb.com/manual/reference/command/count/index.html

Why does the "distinct" and "count" commands happen so slowly on indexed items in MongoDB?

1 Answers1

Linked