Indexes are a vital part of a good query performance. This applies to aggregation queries, also.
In general, query performance depends upon various factors, including amount of data, document size, indexes, how the query is built, and the hardware considerations like processor, hard drive, RAM and of course the network. In case there is other load on the server, it also affects your currently running query.
Slow query? First check if there are there any indexes defined on the collection. Still slow, run a query plan using the explain(). Use the option / mode "executionStats". The query plan document will show, the winning query plan and the execution statistics for that plan. The plan document shows if indexes are used (IXSCAN) or a collection scan (COLLSCAN) and stages; if index is used what index is used. The execution statistics show the number of documents returned, execution time, total index keys examined and documents examined, etc., at different stages.
In the above question scenario, the number of documents in the collections mattered (50+ million). In case you create an index on a large number keys it is going to be a large sized index; and this can be difficult load into memory. If the index cannot be in the memory, there will be disk usage and the query will be slow.
The queries, the first returns the count and the second one distinct values of eid
field:
db.time_series.count( { eid: { $exists : true } } );
db.coll.distinct( eid, { eid: {$exists: true} } )
I tried the above queries with a sampling of similar data of 1.2 million documents. I ran the query plan with stats for the count and the distinct queries. The execution stats are as follows on my machine:
"nReturned" : 0,
"executionTimeMillis" : 778,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,
"nReturned" : 1144845,
"executionTimeMillis" : 775,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,
I created the partial index on the eid
field, as in the issue.
db.time_series.createIndex( { eid: 1 }, { partialFilterExpression: { eid: { $exists: true } } } )
db.time_series.getIndexes() shows the newly created index:
[
...
{
"v" : 2,
"key" : {
"eid" : 1
},
"name" : "eid_1",
"ns" : "test.time_series",
"partialFilterExpression" : {
"eid" : {
"$exists" : true
}
}
}
]
The query plan showed that the index was used. The statistics:
"nReturned" : 0,
"executionTimeMillis" : 4278,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,
"nReturned" : 1144845,
"executionTimeMillis" : 4409,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,
The executionTimeMillis
shown was much higher after the index was created.
Also, note the index was used (there is the query planner stage with IXSCAN). But, it turns out it is of no use. The index keys examined is a high number and is same as the returned documents.
Using Aggregation Queries:
Aggregation queries allow process data in stages. That means you can have more control on the query and how it is processed. Also, aggregation queries can use indexes like in the find queries, and little differently. Once, the pipeline reaches certain stages the indexes will not be used; so make sure appropriate stage is used first that makes use of index (sometimes the query optimizer might do that). These can be noted in the query plan.
The aggregation pipeline for the distinct and count queries.
db.time_series.aggregate( [
{ $match: { eid: { $exists: true } } },
{ $group : { _id : "$eid" } },
{ $project: { eid: "$_id", _id: 0 } }
] )
db.time_series.aggregate( [
{ $match: { eid: { $exists: true } } },
{ $group : { _id : null, count: { $sum: 1 } } } ,
{ $project: { _id: 0 } }
] )
In this case also, I tried with and without the index on the eid
field. The explain can be used like this on the aggregation query:
db.time_series.explain("executionStats").aggregate( [ ... ] )
The explain stats with and without index:
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 4813,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 8322,
"totalKeysExamined" : 1144845,
"totalDocsExamined" : 1144845,
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 1043,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1144845,
"executionTimeMillis" : 3884,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1145477,
Note on Memory Constraints in Aggregation:
- The resulting documents are subject to the 16 MB limit. The limit
doesn't apply to the document size as it flows thru the pipeline
(that is in the intermediate stages). Note the resulting document
size can be controlled using the $limit and $project stages.
- Each stage has a 100 MB RAM/memory usage. The best way to control
this by using indexes with the larger stages (this is because the
indexes are smaller than the documents they reference). Use the
{allowDiskUse : true }
option with the aggregation query. This
should be the last resort as there will be a degradation in
performance when this option is used; this is expected as the h/w
access is much slower to work with as the work will be spilled to
the disk. This is more often used in batch processing.
Conclusion:
Achieving the right query performance is mostly a science. There are quite a few tools for this, like the explain / query planner and the indexes. A bit of trial-and-error runs with various options on sample documents is required too. It is also a math; size of document and the number of documents tell how many bytes of memory the documents and the indexes use.
References: