2

I have about 3 million documents in my database. I have this query to get the min and max latitudes and longitudes contains in my documents because in my app I want to zoom on the existing data contained within a square. It takes about 16 seconds to execute :

Square represents 4 coordinates. tMin and tMax is my time interval (date).

    cursor = db.collection.aggregate([
        {
            "$match":
            {
                "nodeLoc":{"$geoWithin":{"$geometry":square}}, "t": {"$gt": tMin, "$lt": tMax}
            }
        },
        {
            "$group":
            {
                "_id": {},
                "minLat": {"$min": {"$arrayElemAt": [ "$nodeLoc.coordinates", 1]}},
                "maxLat": {"$max": {"$arrayElemAt": [ "$nodeLoc.coordinates", 1]}},
                "minLon": {"$min": {"$arrayElemAt": [ "$nodeLoc.coordinates", 0]}},
                "maxLon": {"$max": {"$arrayElemAt": [ "$nodeLoc.coordinates", 0]}}
            }
            }
    ]
    )

Is there a way I can optimize the $group or the $match stage ? I already created a compound index on nodeLoc (2dsphere) and t but I didn't see any improvement.

Edit:

I remove the indexes but the time to execute the query stay the same.

SwissFr
  • 178
  • 1
  • 12

1 Answers1

1

The group phase will not use an index imho, and this is very expensive scan via all data values.

The only possible idea for me is to reduce match criteria to operate on smaller dataSet and aggregate final results in app.

if we have, let's say 5 days from tMin and tMax - faster response could be achieved by running query in 5 rounds. and finally make an array merge and scan for min, max on 20 array entries.

Makes this any sense for you?

Any comments welcome!

profesor79
  • 9,213
  • 3
  • 31
  • 52
  • I want to avoid this. I might have a thousand of days so I doubt it is effective and also a single day can have up to one hundred thousand documents. – SwissFr Jul 04 '16 at 20:49
  • I found a tradeoff, I make 4 seperate queries with find and sort. For example : cursor = GPSData.find({"nodeLoc":{"$geoWithin":{"$geometry":square}}, "t": {"$gt": tMin, "$lt": tMax}}).sort("nodeLoc.coordinates.1", pymongo.ASCENDING).limit(1) I am down to 2 seconds overall :D – SwissFr Jul 04 '16 at 21:08
  • Actually after a few execution of my code, it takes now 7 seconds (???) – SwissFr Jul 04 '16 at 21:48
  • I found out that $geowithin is suboptimal when you need to find coordinates inside a rectangle. It's faster to check manually if lat > minLat, lat < maxLat. etc... – SwissFr Jul 05 '16 at 08:18
  • thanks for that - a new lesson for Us :-). Could you be so kind and accept my answer as it directed you to solution? – profesor79 Jul 05 '16 at 08:25
  • Thanks for your help – SwissFr Jul 05 '16 at 08:28