15

I have a collection with coordinate data in GeoJSON Point form, from which I need to query for the 10 latest entries within an area. There are now 1.000.000 entries but there will be about 10 times more.

My problem is that when there are lots of entries within the desired area, the performance of my queries drops tremendously (case 3). The test data I currently have is random, but the real data won't be, so picking another index (like in the case 4) based purely on the dimensions of the area won't be possible.

What should I do to get it perform predictably regardless of the area?

1. The collection statistics:

> db.randomcoordinates.stats()
{
    "ns" : "test.randomcoordinates",
    "count" : 1000000,
    "size" : 224000000,
    "avgObjSize" : 224,
    "storageSize" : 315006976,
    "numExtents" : 15,
    "nindexes" : 3,
    "lastExtentSize" : 84426752,
    "paddingFactor" : 1,
    "systemFlags" : 0,
    "userFlags" : 0,
    "totalIndexSize" : 120416128,
    "indexSizes" : {
        "_id_" : 32458720,
        "position_2dsphere_timestamp_-1" : 55629504,
        "timestamp_-1" : 32327904
    },
    "ok" : 1
}

2. The indexes:

> db.randomcoordinates.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "ns" : "test.randomcoordinates",
        "name" : "_id_"
    },
    {
        "v" : 1,
        "key" : {
            "position" : "2dsphere",
            "timestamp" : -1
        },
        "ns" : "test.randomcoordinates",
        "name" : "position_2dsphere_timestamp_-1"
    },
    {
        "v" : 1,
        "key" : {
            "timestamp" : -1
        },
        "ns" : "test.randomcoordinates",
        "name" : "timestamp_-1"
    }
]

3. Find using 2dsphere compound index:

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("position_2dsphere_timestamp_-1").explain()
{
    "cursor" : "S2Cursor",
    "isMultiKey" : true,
    "n" : 10,
    "nscannedObjects" : 116775,
    "nscanned" : 283424,
    "nscannedObjectsAllPlans" : 116775,
    "nscannedAllPlans" : 283424,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 3876,
    "indexBounds" : {

    },
    "nscanned" : 283424,
    "matchTested" : NumberLong(166649),
    "geoTested" : NumberLong(166649),
    "cellsInCover" : NumberLong(14),
    "server" : "chan:27017"
}

4. Find using timestamp index:

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("timestamp_-1").explain()
{
    "cursor" : "BtreeCursor timestamp_-1",
    "isMultiKey" : false,
    "n" : 10,
    "nscannedObjects" : 63,
    "nscanned" : 63,
    "nscannedObjectsAllPlans" : 63,
    "nscannedAllPlans" : 63,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "timestamp" : [
            [
                {
                    "$maxElement" : 1
                },
                {
                    "$minElement" : 1
                }
            ]
        ]
    },
    "server" : "chan:27017"
}

Some have suggested to use {timestamp: -1, position: "2dsphere"} index, so I tried that out as well, but it doesn't seem to perform well enough.

5. Find using Timestamp + 2dsphere compound index

> db.randomcoordinates.find({position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}).sort({timestamp: -1}).limit(10).hint("timestamp_-1_position_2dsphere").explain()
{
    "cursor" : "S2Cursor",
    "isMultiKey" : true,
    "n" : 10,
    "nscannedObjects" : 116953,
    "nscanned" : 286513,
    "nscannedObjectsAllPlans" : 116953,
    "nscannedAllPlans" : 286513,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 4597,
    "indexBounds" : {

    },
    "nscanned" : 286513,
    "matchTested" : NumberLong(169560),
    "geoTested" : NumberLong(169560),
    "cellsInCover" : NumberLong(14),
    "server" : "chan:27017"
}
hleinone
  • 4,470
  • 4
  • 35
  • 49
  • Can you clarify what you mean when you say "so picking another index (like in the case 4) based purely on the dimensions of the area won't be possible."? It seems to me that, regardless of the area size, since you are only looking for the ten most recent points you will always do better with the timestamp index where scanAndOrder is false and nscanned is nearest to n. Given that, I would suggest creating a compound index with timestamp first and position second, however, the current mongo release (2.4.6) will not utilize it in the desired manner: https://jira.mongodb.org/browse/SERVER-10801. – jribnik Sep 19 '13 at 20:07
  • 2
    The area in question is pretty big, and as the coordinates in collection are random, it means that there are many of them, thus the timestamp index is more effective. When the area is small with few entries, it will need to go through all the entries when there are less than 10 in that area when using the timestamp index. On that case position-timestamp compound index is clearly fastest, returning on 2 milliseconds, whereas the timestamp-position index will take more than 2000 ms. I think I need (at least) different indexes for different kinds of areas. – hleinone Sep 20 '13 at 07:38
  • 2
    I think you managed to hit the nail on the head with your comment. If you're asking for the 10 most recent points in your fairly large area, the problem goes from "find all points in an area and then find the 10 most recent" to "iterate through the most recent entries and check if they are in the area." This second one, as you said, is much, much faster if a significant portion of all points are in the polygon. – 3rf Sep 20 '13 at 19:57
  • 1
    Out of curiosity, can you run this type of query a few times without hint, then run explain on it to see what index the query optimizer is using? MongoDB is designed to test and pick the optimal index for your query so you don't have to think about this kind of thing (the feature could be better, but it still works most of the time). – 3rf Sep 20 '13 at 20:01
  • 1
    With the initial indexes, it picks the 2dsphere-timestamp compound index, when I tried to run it couple of times without a hint on a fairly large area. So it didn't pick the optimal timestamp index. – hleinone Sep 23 '13 at 06:58
  • any updates to this? I realize this is a fairly old question just wondering if you found a solution that works for you? @hleinone – sthomps Oct 10 '17 at 19:03
  • @sthomps no this was left unresolved when I left the project few years ago. – hleinone Oct 11 '17 at 13:37

3 Answers3

4

I have seen this question as I was looking for a solution to something similar. This is a very old question gone unanswered, in case others looking for solutions to these kind of situations I will try to explain why approaches mentioned are not ideal for the task at hand and how one can fine tune these queries.

In the first case, so many items being scanned is completely normal. Let me try to explain why:

When Mongodb builds compound index "position_2dsphere_timestamp_-1", it actually creates one B-tree to hold all of the geometries contained in the position key, in this case Points, and for each and every different value in this B-tree, another B-tree is created to hold timestamps in descending order. What this means is that, unless your entries are very, (I mean very) close to each other, secondary B-trees would just hold one entry and the query performance will be almost the same as having an index just on the position field. Except mongodb would be able to use the timestamp value on the secondary b-trees instead of bringing the actual document to memory and check the timestamp.

The same applies to the other scenario when we build the compound index "timestamp_-1_position_2dsphere". It is quite unlikely that two entries are entered at the same time at millisecond precision. So in this scenario; yes we have our data sorted by the timestamp field but then we have lots of other B-trees holding just one entry for each different value of the timestamps. So applying a geoWithin filter will not perform well since it will have to check every entry until the limit is met.

So how can you make these kind of queries perform well? Personally I start with putting as much as fields in front of the geospatial field as I can. But the main trick would be to hold another field lets say "createdDay", which would hold a number at day precision. If you need more precision you may use hour level precision as well, at the cost of performance, it all depends on your project's needs. Your index would look like this: {createdDay:-1, position: "2dsphere"}. Now every document created on the same day would be stored and sorted on the same 2dsphere b-tree index. So mongodb would start from the current day as it should be the greatest value in the index, and make an index scan on the b-tree holding positions of the documents whose createdDay are today. If it finds at least 10 documents it will stop and return those documents, if not it will move to the previous day and so on. This method should greatly increase the performance in your case.

I hope this helps in your case.

AJNeufeld
  • 8,526
  • 1
  • 25
  • 44
  • based on this: "unless your entries are very, (I mean very) close to each other", what happen if I approximate the position of the items in a secondary point property used to query large areas? – Paolo Sanchi May 08 '19 at 08:23
  • Well, it would definetely improve the performance at the cost of precision. But with this approach you would either lose a lot of precision, or not gain enough performance boost. Instead you could try using Google's S2 algorithm to build your on geo-indexes on top of MongoDb's B-trees, that way you would have absolute control over the precision and performance. – Anıl Şimşek May 14 '19 at 14:17
1

Have you tried using the aggregation framework on your dataset?

The query you want would look something like:

db.randomcoordinates.aggregate(
    { $match: {position: {$geoWithin: {$geometry: {type: "Polygon", coordinates: [[[1, 1], [1, 90], [180, 90], [180, 1], [1, 1]]]}}}}},
    { $sort: { timestamp: -1 } },
    { $limit: 10 }
);

Unfortunately, the aggregation framework doesn't have explain in a production build yet, so you'll only know if it makes a huge time difference. If you're fine building from source, it looks like it may be there as of late last month: https://jira.mongodb.org/browse/SERVER-4504. It's also looks like it'll be in the Dev build 2.5.3 that is scheduled to release next Tuesday(10/15/2013).

freethejazz
  • 2,235
  • 14
  • 19
1

What should I do to get it perform predictably regardless of the area?

$geoWithin simply does not operate with Θ(1) efficiency. As I understand it, it would operate with Θ(n) efficiency average case (considering the alg would at most need to check n points, at minimum 10).

However, I would most definitely do some preprocessing on the coordinate collection to ensure that the most recently added coordinates are processed first to give you a better chance of getting Θ(10) efficiency (and sounds like this in addition to using the position_2dsphere_timestamp_-1 would be the way to go)!

Some have suggested to use {timestamp: -1, position: "2dsphere"} index, so I tried that out as well, but it doesn't seem to perform well enough.

(Please see response to initial question.)

Additionally, the following may be of use!

Optimization Strategies for MongoDB

Hope this helps!

TL;DR you can fool with the indexes all you want, but you're not going to get any greater efficiency out of $geoWithin unless you rewrite it.

That being said you could always just focus on optimizing the index performance and rewrite the function if you'd like!

Drew
  • 2,583
  • 5
  • 36
  • 54