1

I have a collection with 10M records, with the following schema:

{
   "_id": ObjectId
   "uid": string
}

uid is a random string for testing purposes, generated with unique values.

There are 2 indexes: {"_id": 1} and {"uid": 1}.

The following two queries produce very inefficient query plans, which scan/fetch all 10M documents and take several seconds to run:

db.test.find({}).sort({"uid": 1, "_id": 1}).limit(10).explain()

(...)
    "queryPlanner" : {
        "namespace" : "test.test",
        "indexFilterSet" : false,
        "parsedQuery" : {

        },
        "queryHash" : "EEBF6C62",
        "planCacheKey" : "BC5E0BD8",
        "maxIndexedOrSolutionsReached" : false,
        "maxIndexedAndSolutionsReached" : false,
        "maxScansToExplodeReached" : false,
        "winningPlan" : {
            "stage" : "SORT",
            "sortPattern" : {
                "uid" : 1,
                "_id" : 1
            },
            "memLimit" : 104857600,
            "limitAmount" : 10,
            "type" : "simple",
            "inputStage" : {
                "stage" : "COLLSCAN",
                "direction" : "forward"
            }
        },
        "rejectedPlans" : [ ]
    },
(...)
db.test.find({"uid": {"$gt": "a"}}).sort({"uid": 1, "_id": 1}).limit(10).explain()
(..)
    "queryPlanner" : {
        "namespace" : "test.test",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "uid" : {
                "$gt" : "a"
            }
        },
        "queryHash" : "F4847212",
        "planCacheKey" : "8F79D766",
        "maxIndexedOrSolutionsReached" : false,
        "maxIndexedAndSolutionsReached" : false,
        "maxScansToExplodeReached" : false,
        "winningPlan" : {
            "stage" : "SORT",
            "sortPattern" : {
                "uid" : 1,
                "_id" : 1
            },
            "memLimit" : 104857600,
            "limitAmount" : 10,
            "type" : "simple",
            "inputStage" : {
                "stage" : "FETCH",
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "uid" : 1
                    },
                    "indexName" : "uid_1",
                    "isMultiKey" : false,
                    "multiKeyPaths" : {
                        "uid" : [ ]
                    },
                    "isUnique" : false,
                    "isSparse" : false,
                    "isPartial" : false,
                    "indexVersion" : 2,
                    "direction" : "forward",
                    "indexBounds" : {
                        "uid" : [
                            "(\"a\", {})"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : [ ]
    },
(...)

The analogous datamodel and queries using PostgreSQL produce very efficient query plans than run in <1ms.

explain (analyze, buffers)
select * from test
order by uid, id
limit 10;

Limit  (cost=0.65..1.87 rows=10 width=69) (actual time=0.477..0.480 rows=10 loops=1)
  Buffers: shared hit=15
  ->  Incremental Sort  (cost=0.65..1221670.34 rows=10000000 width=69) (actual time=0.477..0.478 rows=10 loops=1)
        Sort Key: uid, id
        Presorted Key: uid
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=15
        ->  Index Scan using uid on test  (cost=0.56..771670.34 rows=10000000 width=69) (actual time=0.085..0.190 rows=11 loops=1)
              Buffers: shared hit=15
Planning Time: 0.400 ms
Execution Time: 0.836 ms
explain (analyze, buffers)
select * from test
where uid > 'a'
order by uid, id
limit 10;

Limit  (cost=0.65..1.90 rows=10 width=69) (actual time=0.289..0.292 rows=10 loops=1)
  Buffers: shared hit=15
  ->  Incremental Sort  (cost=0.65..1246670.34 rows=10000000 width=69) (actual time=0.287..0.288 rows=10 loops=1)
        Sort Key: uid, id
        Presorted Key: uid
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=15
        ->  Index Scan using uid on test  (cost=0.56..796670.34 rows=10000000 width=69) (actual time=0.062..0.229 rows=11 loops=1)
              Index Cond: ((uid)::text > 'a'::text)
              Buffers: shared hit=15
Planning Time: 4.809 ms
Execution Time: 0.347 ms

Intuitively, since there is an index on uid, it should be very straightforward to just fetch N rows, already sorted by uid, and then sort also by _id within the same uid (which in this example is useless, since uid is unique). This seems exactly what PostgreSQL is doing.

Are the inefficient query plans just a limitation on the MongoDB query planner, or am I doing something wrong?

user41951
  • 182
  • 1
  • 1
  • 10
  • 1
    Personally I guess this is why different kind of databases are used for different usage. In the case of mongodb, rarely do we ever do this kind of sorting where it involves millions of documents. Mongodb is a document store, you either get the document by it's uid, or, you use some kind of filter to get the specific document you want. It's efficient in that sense. If you want to sort all documents, then mongodb will have to go through each and every indexes before it returns the result to you. – Someone Special Sep 20 '22 at 09:43
  • quoted from a site > MongoDB does in-memory sorting. All documents stored on disk are read into RAM, then a sorting algorithm is performed. In-memory sorting is inefficient, time-consuming & sorting a large number of documents in memory is quite expensive. Moreover, MongoDB aborts the operation if memory usage is more than 32MB. (you can set this higher) – Someone Special Sep 20 '22 at 09:46
  • To provide a bit more context, this type of queries are used for keyset pagination, where I want to sort the results by `uid` (or an arbitrary field) but also sort by `_id` to keep a stable order in case of duplicates. So it should be a common query pattern. – user41951 Sep 20 '22 at 09:48
  • In mongodb if we were to do pagination and we use $skip + $sort, it will be very slow. The recommended way is to provide the last uid of the previous page, and use $gt: uid query, to get the next dataset – Someone Special Sep 20 '22 at 09:50
  • @SomeoneSpecial Yes, that's what I mean by keyset pagination. The second query in my post is exactly the $gt: uid query. – user41951 Sep 20 '22 at 09:52
  • Right I just saw that, which means a redesign of your DB is probably required, or explore whether Mongodb is right for your usage. You may also attempt to use single index with both _id and uid and see whether it improves your query timings – Someone Special Sep 20 '22 at 10:01

2 Answers2

0

I am not exactly sure, whether to term it a limitation, but yeah in the first case, for the query:

db.test.find({}).sort({"uid": 1, "_id": 1}).limit(10).explain()

MongoDB doesn't use any index, because the filter object is empty, and there is no index present, which is sorted in the order as expected. That's why you see the stage COLLSCAN. If you simply try running these, queries

db.test.find({}).sort({"uid": 1}).limit(10).explain()

OR

db.test.find({}).sort({"_id": 1}).limit(10).explain()

You'll see the IXSCAN stage coming into play because there are indexes present, which can satisfy the sort order directly. To bring the index into play, you can simply add a filter like:

 db.test.find({uid: {$exists: true}}).sort({"_id": 1, "uid": 1}).limit(10).explain()

For the second query,

db.test.find({"uid": {"$gt": "a"}}).sort({"uid": 1, "_id": 1}).limit(10).explain()

MongoDB is behaving the same as PostgresSQL, it is using the uid index to filter the elements, and then sort the documents by _id, as the index is already sorted by uid.

For the best case scenario, if you need to sort both by uid and _id, in most cases, you can simply create a composite index consisting of both uid and _id, like this:

db.test.createIndex({uid: 1, _id: 1})

After creating this index, both of your queries will use this index, regardless of whether the filters are present or not.

Charchit Kapoor
  • 8,934
  • 2
  • 8
  • 24
0

The fact that MongoDB is not doing an incremental sort like PostgreSQL appears to be a limitation of the system. It doesn't look like you are doing anything directly wrong in that regard. There are still two observations/recommendations that can be made here.

Compound Index

As @Charchit Kapoor mentions, a compound index on {uid: 1, _id: 1} will allow MongoDB to efficiently perform this query as written (and avoid a blocking sort altogether).

Use a Single Field

You mention the following:

uid is a random string for testing purposes, generated with unique values.

This leads to two questions:

  1. Why are you using two separate unique fields (as opposed to, for example, just using the string value directly as your _id value)?
  2. Assuming that you want both, why are you performing a compound sort?

Looking at the latter question, if the uid field is unique then appending _id to the sort does not change the results in any way. There is only a single document with a given uid value so there is no additional ordering that the trailing _id field provides.

Assuming that you want to keep both fields, then simply dropping _id from the sort should keep the result set the same while allowing MongoDB to efficiently use the { uid: 1 } index to perform the operation.

user20042973
  • 4,096
  • 2
  • 3
  • 14