What is the proper way to query MongoDB matching $in with large collection?

Question

There is deal collection containing 50 million documents with this structure

{
  "user": <string>,
  "price": <number>
}

The collection has this index

{
    "v" : 2,
    "unique" : false,
    "key" : {
        "user" : 1
    },
    "name" : "deal_user",
    "background" : true,
    "sparse" : true
}

Trying to execute this aggregation

db.getCollection('deal').aggregate([       
  {
      $match: {user: {$in:  [ 
          "u-4daf809bd", 
          "u-b967c0a84", 
          "u-ae03417c1", 
          "u-805ce887d", 
          "u-0f5160823", 
          /* more items */
      ]}}
  },  
  {
      $group: {
          _id: "$user",
          minPrice: {$min: "$price"}
      }
  }
])

When array size inside $match / $in is less than ~50 then query response time mostly is less than 2 seconds. For large arrays (size > 200) response time is around 120 seconds.

Also tried an approach with chunking the array into parts containing 10-50 elements and querying in parallel. There is a strange (reproducible) effect: most of the queries 80% respond fast (2-5 seconds), but there are some that hung ~100 seconds, so parallelisation did not bring a fruit.

I guess that the are some kind of "slow" and "normal" values, but can not explain what is going on, because they all belong to the same index and are expected to be fetched in approx same time. It slightly correlates with amount of duplicates in user field (i.e. each grouped value), but looks like:
"big size of duplicates" does not always entail "slow"

Please, help me to understand why this MongoDB query behaves this way.
What is the proper way to do this kind of queries?

Interesting! If you do a single `$match/$in` on one of the "slow" chunks, does it respond quickly? — rickhg12hs, Feb 11 '22 at 21:30
With no particular reason to think so, I wonder if you are hitting `internalDocumentSourceGroupMaxMemoryBytes`. — rickhg12hs, Feb 11 '22 at 21:52
@rickhg12hs there are cases when even a chunk containing single `user` value in `$/match/$in` causes long response time. And if this value is included into any "normal" chuck, then the chunk becomes "slow". — diziaq, Feb 12 '22 at 03:54
Run explain `coll.explain("executionStats").aggregate(...)` and check the totalDocsExamined — qtxo, Feb 12 '22 at 03:56

What is the proper way to query MongoDB matching $in with large collection?

0 Answers0