Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Question

Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":

db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})

This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:

db.locations.count()
1513671

So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.

MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.

MongoDB log show the following:

2020-05-30T12:21:18.598+0200 I COMMAND  [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms

score 0 · Accepted Answer · answered May 31 '20 at 19:56

Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:

{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.

You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.

You want to force mongo to come up with a better query plan than that that uses the {internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.

If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.

Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).

If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.

Thank your very much ! I'll test the various solution and report back with speed measurement's — Adam C., Jun 01 '20 at 20:08
Did a try today, sadly it seems Python asyncio driver cannot pass hint, despite what's written in its documentation. I asked upstream about it. Will let you know. — Adam C., Jun 02 '20 at 18:18
That's a bummer! I think doing the two queries (without a hint) is a more likely solution, and you can adapt that pattern to be a bit quicker with a self-join if you need to — klhr, Jun 03 '20 at 15:29

Adam C. · Answer 2 · 2020-06-02T19:07:05.150

So finally I've been able to do all the testing, here is all version I wrote, thanks to willis answer and the result:

Original aggregate query

mongo_query = [
  {"$match": group_filter},
  {"$sort": {"timestamp": -1}},
  {"$group": {"_id": "$internal_id", "doc": {"$first": "$$ROOT"}}},
]

res = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query)
res = await res.to_list(None)

9.61 seconds

Give MongoDB a hint to use proper index (filter internal_id first)

from bson.son import SON

cursor = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query, hint=SON([("internal_id", 1), ("timestamp", -1)]))
res = await cursor.to_list(None)

Not working, MongoDB replies with an exception, saying sorting consume too much memory

Split aggregation, to first find latest timestamp for each internal_id

cursor = mongo.db[self.factory.config.mongo_collection].aggregate([{"$group": {"_id": "$internal_id", "timestamp": {"$max": "$timestamp"}}}])
res = await cursor.to_list(None)

or_query = []
for entry in res:
    or_query.append({"internal_id": entry["_id"], "timestamp": entry["timestamp"]})
cursor = mongo.db[self.factory.config.mongo_collection].find({"$or": or_query})
fixed_res = await cursor.to_list(None)

1.88 seconds, a lot better but still not that fast

Parallel coroutines (and the winner is....)

In the meanwhile, as I already have the list of internal_id, and I'm using asynchronous Python, I went for parallel coroutine, getting latest entry for a single internal_id at once:

fixed_res: List[Dict] = []

async def get_one_result(db_filter: Dict) -> None:
    """ Coroutine getting one result for each known internal ID """

    cursor = mongo.db[self.factory.config.mongo_collection].find(db_filter).sort("timestamp", -1).limit(1)
    res = await cursor.to_list(1)
    if res:
        fixed_res.append(res[0])

coros: List[Awaitable] = []
for internal_id in self.list_of_internal_ids:
    coro = get_one_result({"internal_id": internal_id})
    coros.append(coro)
await asyncio.gather(*coros)

0.5s, way better than others

If you don't have a list of internal_id

There's an alternative I did not implement but I confirmed the call is very fast: use lowlevel distinct command against {internal_id: 1} index to retrieve list of individual IDs, then use parallel calls.