1

I have a collection with 500k documents with the following structure:

{
    "_id" : ObjectId("5f2d30b0c7cc16c0da84a57d"),
    "RecipientId" : "6a28d20f-4741-4c14-a055-2eb2593dcf13",
    
    ...
    
    "Actions" : [ 
        {
            "CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
            "ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
            "ActionDescription" : "OPEN"
        }, 
        ...
    ]
}

I need to count the top level documents whose subdocuments inside the "Actions" array meet certain criteria, and for this I've created the following Multikey index (taking only the "ActionDatetime" field as an example):

db.getCollection("recipients").createIndex( { "Actions.ActionDatetime": 1 } )

The problem is that when I write the query using an $elemMatch, the operation is much slower than when I don't use the Multikey index at all:

db.getCollection("recipients").count({
  "Actions":
    { $elemMatch:{ ActionDatetime: {$gt: new Date("1950-08-04")} }}}
)

The stats for this query:

{
    "executionSuccess" : true,
    "nReturned" : 0,
    "executionTimeMillis" : 13093,
    "totalKeysExamined" : 8706602,
    "totalDocsExamined" : 500000,
    "executionStages" : {
        "stage" : "COUNT",
        "nReturned" : 0,
        "executionTimeMillisEstimate" : 1050,
        "works" : 8706603,
        "advanced" : 0,
        "needTime" : 8706602,
        "needYield" : 0,
        "saveState" : 68020,
        "restoreState" : 68020,
        "isEOF" : 1,
        "nCounted" : 500000,
        "nSkipped" : 0,
        "inputStage" : {
            "stage" : "FETCH",
            "filter" : {
                "Actions" : {
                    "$elemMatch" : {
                        "ActionDatetime" : {
                            "$gt" : ISODate("1950-08-04T00:00:00.000Z")
                        }
                    }
                }
            },
            "nReturned" : 500000,
            "executionTimeMillisEstimate" : 1040,
            "works" : 8706603,
            "advanced" : 500000,
            "needTime" : 8206602,
            "needYield" : 0,
            "saveState" : 68020,
            "restoreState" : 68020,
            "isEOF" : 1,
            "docsExamined" : 500000,
            "alreadyHasObj" : 0,
            "inputStage" : {
                "stage" : "IXSCAN",
                "nReturned" : 500000,
                "executionTimeMillisEstimate" : 266,
                "works" : 8706603,
                "advanced" : 500000,
                "needTime" : 8206602,
                "needYield" : 0,
                "saveState" : 68020,
                "restoreState" : 68020,
                "isEOF" : 1,
                "keyPattern" : {
                    "Actions.ActionDatetime" : 1.0
                },
                "indexName" : "Actions.ActionDatetime_1",
                "isMultiKey" : true,
                "multiKeyPaths" : {
                    "Actions.ActionDatetime" : [ 
                        "Actions"
                    ]
                },
                "isUnique" : false,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2,
                "direction" : "forward",
                "indexBounds" : {
                    "Actions.ActionDatetime" : [ 
                        "(new Date(-612576000000), new Date(9223372036854775807)]"
                    ]
                },
                "keysExamined" : 8706602,
                "seeks" : 1,
                "dupsTested" : 8706602,
                "dupsDropped" : 8206602
            }
        }
    }
}

This query took 14sec to execute, whereas if I remove the index, the COLLSCAN takes 1 second.

I understand that I'd have a better performance by not using $elemMatch, and filtering by "Actions.ActionDatetime" directly, but in reality I'll need to filter by more than one field inside the array, so the $elemMatch becomes mandatory.

I suspect that it's the FETCH phase which is killing the performance, but I've noticed that when i use the "Actions.ActionDatetime" directly, MongoDB is able to use a COUNT_SCAN instead of the fetch, but the performance is still poorer than the COLLSCAN (4s).

I'd like to know if there's a better indexing strategy for indexing subdocuments with high cardinality inside an array, or if I'm missing something with my current approach. As the volume grows, indexing this information will be a necessity and I don't want to rely on a COLLSCAN.

djed
  • 141
  • 10
  • `$elemMatch` is redundant if you are only testing one field inside the array. How does the indexed performance compare without the elemMatch? – Joe Aug 07 '20 at 16:57
  • @Joe Yeah, I mentioned that in the question, but in reality I will need to query more than one field, so I will need the $elemMatch anyway, I'm just using one field as an example. The performance without the $elemMatch is better (4 seconds), because the planner uses a COUNT_SCAN and doesn't need to FETCH any elements. However, the COLLSCAN performance is still better (1s). I can't tell if the problem lies in the usage of $elemMatch, or if it's a problem with the multikey index, or both. – djed Aug 07 '20 at 17:05

2 Answers2

2

The problem here is twofold:

  • Every document matches your query
    Consider the analogy of an index being the catalog in a library. If you want to find a single book, looking it up in the catalog permits you to go straight to the shelf holding it, which is much faster than starting at the first shelf and searching through the books (unless of course it actually is on that first shelf). However, if you want to get all of the books in the library, it will be much faster to just start taking them off them shelf than checking the catalog for each one and then going to get it.
    While this analogy is far from perfect it does show that the collection scan can be expected to be much more efficient than index lookups when a large percentage of the documents will be considered.

  • Multikey indexes have more than one entry for each document
    When mongod builds an index on an array, it creates a separate entry in the index for each discreet element. When you match a value from an array element, the index can get you to a matching document quickly, but because a single document is expected to have multiple entries in the index deduplication is required afterward.

These are further exacerbated by the $elemMatch. Since the index contains values for the separate indexed fields, it is unable to determine if the values for different fields occur within the same array element from the index, so it must load each document to check that.

Essentially, when using elemMatch with the index and a query that matches every document, the mongod node will examine the index to identify matching values, deduplicate that list, then load each document (likely in the order encountered in the index) to see if a single array value satisfies the elemMatch.

When compared with the non-indexed collection scan execution where the mongod must load each document in the order encountered on disk, and check if a single array element matches satisfies the elemMatch, it should be apparent that the indexed query will perform worse if a large percentage of the documents match the query.

Joe
  • 25,000
  • 3
  • 22
  • 44
  • Thank you for the response @Joe. Do you have any advice as to how I can improve the performance of this type of query? – djed Aug 10 '20 at 08:47
2

TLDR: this is the expected behaviour of a multikey index combined with an $elemMatch.

Why is this happening?

So it is the FETCH stage that's ruining your query performance, unfortunately this is the expected behaviour.

From the covered query section of the multikey index documents:

Multikey indexes cannot cover queries over array field(s).

Meaning all information about a sub-document is not in the multikey index - count with one field is an exception where it can do better. But in this case $elemMatch is still forcing a FETCH stage? why it is only using a single field.

Imagine this scenario:

//doc1
{
   "Actions" : [ 
        {
            "CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
            "ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
            "ActionDescription" : "OPEN"
        }, 
        ...
    ]
}
//doc2
{
   "Actions" : {
            "CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
            "ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
            "ActionDescription" : "OPEN"
   }
}

Because Mongo "flattens" the array it indexes, once the index is built Mongo cannot differentiate between these 2 documents, but $elemMatch requires an array object to match it has to fetch these documents to determine which one qualifies. This is the exact problem you're facing.

What can you do?

Well not much sadly. I'm not sure how dynamic your queries are but the only way to solve this issue is to preprocess the documents to contain the "answers" to your queries.

I still find it hard to believe that COLSCAN is doing better than the index query. I'm assuming you're matching a large portion of your collection combined with the fact that Actions array are very large.

What I would suggest as performance will keep being an issue especially if your queries will continue to match a large portion of the collection is to restructure you data. Just save each Actions entry as it's own document.

{
   "Actions" : {
            "CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
            "ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
            "ActionDescription" : "OPEN"
   }
}

Then your query will be allowed to use an index, you'll have to use a different query than count. A distinct on RecipientId sounds like a valid options.

Tom Slabbaert
  • 21,288
  • 10
  • 30
  • 43
  • Thanks for the explanation. Just to make sure I understand, you're suggesting keeping the Actions documents in a separate collection? The problem with that approach is that in addition to filtering by fields in the Actions objects, i also need to filter by other fields in the parent document. So if I kept them in separate collections i'd have to join the different collections which is something I'm looking to avoid too. (Another note, you missed the [ ] in the doc2 Actions field) – djed Aug 10 '20 at 08:50
  • I didn't miss the `[]` in doc2. i was intentionally using this structure discrepancy to explain why `FETCH` is forced. – Tom Slabbaert Aug 10 '20 at 09:08