How to improve performance of Pymongo aggreation

Question

I have a Mongoengine model representing one entry of a distance matrix:

class Distance(db.Document):
    """
        An instance of the distance matrix.
    """

    orig = db.ObjectIdField(required=True, unique_with='dest')
    dest = db.ObjectIdField(required=True, unique_with='orig')
    paths = db.ListField(db.EmbeddedDocumentField(PathEmbedded))

    meta = {
        'indexes': [
            {'fields': ['orig'], 'cls': False},
            {'fields': ['dest'], 'cls': False},
            {'fields': ['orig', 'dest'], 'cls': False},
            {'fields': ['dest', 'orig'], 'cls': False}
        ],
    }

Because MongoEngine doesn't support aggregation, I've created a pymongo aggregation to get all distances between a list of location ids, so i have:

    pipeline = [
        {'$match': {
            'orig': {
                '$in': list_of_origins
            },
            'dest': {
                '$in': list_of_destinations
            }
        }},
        {'$project': {
            'paths': 1,
            '_id': 0,
            'ids': {'orig': '$orig', 'dest': '$dest'}
        }},
    ]

    # Open a pymongo connection to the db
    # (we're skipping the moongengine ORM because it lacks support for aggregation)
    pymongo_connection = MongoClient(host=config.db_host, port=config.db_port)
    pymongo_db = pymongo_connection[config.db_name]

    cursor = pymongo_db.distance.aggregate(pipeline, cursor={}, explain=False)

    response.data = json_util.dumps(cursor)

Currently my list_of_origins and list_of_destinations have 100 elements each, thus resulting in 10.000 distances to be fetched.
The average runtime of this query in pymongo is ~1.15 seconds.
This seemed quite slow to me, so for comparison I wrote the same aggregation as a JS:

db.distance.aggregate([
    {$match: {
        'dest': {$in: [
            ObjectId('5436828e4ee264cf95bbb2a0'),
            ObjectId('543682904ee264cf95bbbd1d'),
            ...
            ObjectId('5436828e4ee264cf95bbb23a')
        ]}, 
        'orig': {$in: [
            ObjectId('5436828e4ee264cf95bbb0e1'),
            ObjectId('543682904ee264cf95bbbe5b'),
            ...
            ObjectId('543682904ee264cf95bbbc86')
        ]}
    }},
    {$project: {
        'paths': 1,
        '_id': 0,
        'ids': {'dest': '$dest', 'orig': '$orig'}}
    }
])

and ran it from the shell to get timings: time mongo < distance_matrix_aggregation.js
average run time in this format is 65ms

So using pymongo is increasing the query time almost 20 fold. Considering that this application will eventually need to return distance matrices with millions of elements this performance penalty will become a serious problem.

In order to reduce the query time I ran explain to check if the indices were being used, but I get an error: "planError" : "InternalError No plan available to provide stats" which seems to be related to a MongoDB bug. But since I created the indices explicitely in the MongoEngine model, I don't think its missing indices.

I'm wondering where the performance penalty actually happens:

Is it some pre-processing done by pymongo before performig the actual query to MongoDB? If so, running a raw query should avoid that, so I was searching for a way to run a raw mongodb query from pymongo but didn't find out how to do that.
Is it because json_util.dumps(cursor) iterates over each cursor element individually, thus producing a large amount of DB accesses? I've tried setting cursor.batch_size(20000)and saw no performance improvements...

Any tips for improving performance on this?

Update

As suggested in comment I've run:

> db.distance.find({ "orig" : { "$in" : origins }, "dest" : { "$in" : destinations }}, { "paths" : 1, "_id" : 0 }).explain()

Which returned:

{
        "cursor" : "BtreeCursor orig_1_dest_1",
        "isMultiKey" : false,
        "n" : 10100,
        "nscannedObjects" : 10100,
        "nscanned" : 10200,
        "nscannedObjectsAllPlans" : 10405,
        "nscannedAllPlans" : 10506,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 81,
        "nChunkSkips" : 0,
        "millis" : 26,
        "indexBounds" : {
                "orig" : [
                        [
                                ObjectId("5436828e4ee264cf95bbb0e1"),
                                ObjectId("5436828e4ee264cf95bbb0e1")
                        ],
                        <SNIP - 100 more OIds >
                ],
                "dest" : [
                        [
                                ObjectId("5436828e4ee264cf95bbb0e1"),
                                ObjectId("5436828e4ee264cf95bbb0e1")
                        ],
                        <SNIP - 100 more OIds >
                ]
        },
        "server" : <SNIP>,
        "filterSet" : false
}

So, here the query is taking 26 ms to run, which is great. Unfortunately, the equivalent MongoEngine Query Distance.objects.filter(orig__in=origins, dest__in=destinations) takes 3.9 s and I don't really know what to make of it's explain() output:

{
    "nYields": 81,
    "nscannedAllPlans": 10608,
    "filterSet": false,
    "allPlans": [
        {
            "nChunkSkips": 0,
            "n": 10100,
            "cursor": "BtreeCursor orig_1_dest_1",
            "scanAndOrder": false,
            "indexBounds": {
                "dest": [ <SNIP> ],
                "orig": [ <SNIP> ],
                ]
            },
            "nscannedObjects": 10100,
            "isMultiKey": false,
            "indexOnly": false,
            "nscanned": 10200
        },
        {
            "nChunkSkips": 0,
            "n": 101,
            "cursor": "BtreeCursor dest_1_orig_1",
            "scanAndOrder": false,
            "indexBounds": {
                "dest": [ <SNIP> ],
                "orig": [ <SNIP> ],
            },
            "nscannedObjects": 101,
            "isMultiKey": false,
            "indexOnly": false,
            "nscanned": 102
        },
        {
            "nChunkSkips": 0,
            "n": 97,
            "cursor": "BtreeCursor orig_1",
            "scanAndOrder": false,
            "indexBounds": {
                "orig": [ <SNIP> ],
            },
            "nscannedObjects": 102,
            "isMultiKey": false,
            "indexOnly": false,
            "nscanned": 102
        },
        {
            "nChunkSkips": 0,
            "n": 95,
            "cursor": "BtreeCursor dest_1",
            "scanAndOrder": false,
            "indexBounds": {
                "dest": [ <SNIP> ],
            },
            "nscannedObjects": 102,
            "isMultiKey": false,
            "indexOnly": false,
            "nscanned": 102
        },
        {
            "nChunkSkips": 0,
            "n": 94,
            "cursor": "BtreeCursor _cls_1",
            "scanAndOrder": false,
            "indexBounds": {
                "_cls": [
                    [
                        "Distance",
                        "Distance"
                    ]
                ]
            },
            "nscannedObjects": 102,
            "isMultiKey": false,
            "indexOnly": false,
            "nscanned": 102
        }
    ],
    "millis": 11,
    "nChunkSkips": 0,
    "server": "darkStar9:27017",
    "n": 10100,
    "cursor": "BtreeCursor orig_1_dest_1",
    "scanAndOrder": false,
    "indexBounds": {
        "dest": [ <SNIP> ],
        "orig": [ <SNIP> ],
    },
    "nscannedObjectsAllPlans": 10507,
    "isMultiKey": false,
    "stats": {
        "works": 10201,
        "isEOF": 1,
        "needFetch": 0,
        "needTime": 100,
        "yields": 81,
        "invalidates": 0,
        "unyields": 81,
        "type": "KEEP_MUTATIONS",
        "children": [
            {
                "works": 10201,
                "isEOF": 1,
                "forcedFetches": 0,
                "needFetch": 0,
                "matchTested": 10100,
                "needTime": 100,
                "yields": 81,
                "alreadyHasObj": 0,
                "invalidates": 0,
                "unyields": 81,
                "type": "FETCH",
                "children": [
                    {
                        "works": 10201,
                        "boundsVerbose": "field #0['orig']: [ <SNIP> ], field #1['dest']: [ <SNIP> ]",
                        "dupsTested": 0,
                        "yieldMovedCursor": 0,
                        "isEOF": 1,
                        "needFetch": 0,
                        "matchTested": 0,
                        "needTime": 100,
                        "keysExamined": 10200,
                        "seenInvalidated": 0,
                        "dupsDropped": 0,
                        "yields": 81,
                        "isMultiKey": 0,
                        "invalidates": 0,
                        "unyields": 81,
                        "type": "IXSCAN",
                        "children": [ ],
                        "advanced": 10100,
                        "keyPattern": "{ orig: 1, dest: 1 }"
                    }
                ],
                "advanced": 10100
            }
        ],
        "advanced": 10100
    },
    "indexOnly": false,
    "nscanned": 10200,
    "nscannedObjects": 10100
}

The find query `db.distance.find({ "orig" : { "$in" : list_of_origins }, "dest" : { "$in" : list_of_destinations }, { "paths" : 1, "_id" : 0 })` does exactly the same thing up to a small difference in output structure. Why are you using an aggregation instead? Try running an explain on the find and posting it so we can see what the index usage is like. — wdberkeley, Dec 26 '14 at 23:36
@wdberkeley: That is a good point -at some point there used to be more operations in the aggregation, but the doc structure has since changed. The aggregation is not actually required required anymore. — Emanuel Ey, Jan 05 '15 at 15:56
The explains indicate an efficiently indexed query (`n` very close to `nscanned`). Did the 3.9s come from timing the explain run of the query in mongoengine? mongoengine produced a verbose explain which ran all candidate query plans - effectively running (in parallel) the query in every different way that used indexes. It's not a good indication of how long the query will take. — wdberkeley, Jan 06 '15 at 03:51
The 3.9s is the time taken by the actual query, not the explain. — Emanuel Ey, Jan 06 '15 at 11:39

How to improve performance of Pymongo aggreation

0 Answers0