I have a collection with 10M records, with the following schema:
{
"_id": ObjectId
"uid": string
}
uid
is a random string for testing purposes, generated with unique values.
There are 2 indexes: {"_id": 1}
and {"uid": 1}
.
The following two queries produce very inefficient query plans, which scan/fetch all 10M documents and take several seconds to run:
db.test.find({}).sort({"uid": 1, "_id": 1}).limit(10).explain()
(...)
"queryPlanner" : {
"namespace" : "test.test",
"indexFilterSet" : false,
"parsedQuery" : {
},
"queryHash" : "EEBF6C62",
"planCacheKey" : "BC5E0BD8",
"maxIndexedOrSolutionsReached" : false,
"maxIndexedAndSolutionsReached" : false,
"maxScansToExplodeReached" : false,
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"uid" : 1,
"_id" : 1
},
"memLimit" : 104857600,
"limitAmount" : 10,
"type" : "simple",
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
},
(...)
db.test.find({"uid": {"$gt": "a"}}).sort({"uid": 1, "_id": 1}).limit(10).explain()
(..)
"queryPlanner" : {
"namespace" : "test.test",
"indexFilterSet" : false,
"parsedQuery" : {
"uid" : {
"$gt" : "a"
}
},
"queryHash" : "F4847212",
"planCacheKey" : "8F79D766",
"maxIndexedOrSolutionsReached" : false,
"maxIndexedAndSolutionsReached" : false,
"maxScansToExplodeReached" : false,
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"uid" : 1,
"_id" : 1
},
"memLimit" : 104857600,
"limitAmount" : 10,
"type" : "simple",
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"uid" : 1
},
"indexName" : "uid_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"uid" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"uid" : [
"(\"a\", {})"
]
}
}
}
},
"rejectedPlans" : [ ]
},
(...)
The analogous datamodel and queries using PostgreSQL produce very efficient query plans than run in <1ms.
explain (analyze, buffers)
select * from test
order by uid, id
limit 10;
Limit (cost=0.65..1.87 rows=10 width=69) (actual time=0.477..0.480 rows=10 loops=1)
Buffers: shared hit=15
-> Incremental Sort (cost=0.65..1221670.34 rows=10000000 width=69) (actual time=0.477..0.478 rows=10 loops=1)
Sort Key: uid, id
Presorted Key: uid
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 25kB Peak Memory: 25kB
Buffers: shared hit=15
-> Index Scan using uid on test (cost=0.56..771670.34 rows=10000000 width=69) (actual time=0.085..0.190 rows=11 loops=1)
Buffers: shared hit=15
Planning Time: 0.400 ms
Execution Time: 0.836 ms
explain (analyze, buffers)
select * from test
where uid > 'a'
order by uid, id
limit 10;
Limit (cost=0.65..1.90 rows=10 width=69) (actual time=0.289..0.292 rows=10 loops=1)
Buffers: shared hit=15
-> Incremental Sort (cost=0.65..1246670.34 rows=10000000 width=69) (actual time=0.287..0.288 rows=10 loops=1)
Sort Key: uid, id
Presorted Key: uid
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 25kB Peak Memory: 25kB
Buffers: shared hit=15
-> Index Scan using uid on test (cost=0.56..796670.34 rows=10000000 width=69) (actual time=0.062..0.229 rows=11 loops=1)
Index Cond: ((uid)::text > 'a'::text)
Buffers: shared hit=15
Planning Time: 4.809 ms
Execution Time: 0.347 ms
Intuitively, since there is an index on uid
, it should be very straightforward to just fetch N rows, already sorted by uid
, and then sort also by _id
within the same uid
(which in this example is useless, since uid
is unique). This seems exactly what PostgreSQL is doing.
Are the inefficient query plans just a limitation on the MongoDB query planner, or am I doing something wrong?