2

I'm running a geoWithin query with a polygon which is around 500km², and it's taking a very long time to execute, anywhere between 30s and 5 minutes. The collection is only 180k rows, and the polygon could be anywhere from 2km² to 10,000km². The server has around 4gb of RAM. Running locally (to eliminate network lag) has no noticeable effect.

I have setup a 2dsphere index on the collection, and limited the number of fields to only return _id (for now).

This is what my documents look like:

{
  "_id" : ObjectId("..."),
  "geometry" : {
    "type" : "MultiPolygon",
    "coordinates" : [[...]]
  },
  "area_sq_m" : 6699.1309787227955894
}

Here's my indexes:

[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_",
        "ns" : "db.output_areas"
    },
    {
        "v" : 1,
        "key" : {
            "geometry" : "2dsphere"
        },
        "name" : "geometry_2dsphere",
        "ns" : "db.output_areas",
        "2dsphereIndexVersion" : 2
    }
]

Here's my query:

{
    "geometry": {
        $geoWithin: {
            $geometry: {
                type: 'Polygon',
                coordinates: [[ [lng,lat], [lng,lat], [lng,lat] ...]]
            }
        }
    }
}

And here's the output from running explain()

{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "db.output_areas",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "geometry" : {
        "$geoWithin" : {
          "$geometry" : {
            "type" : "Polygon",
            "coordinates" : [...]
          }
        }
      }
    },
    "winningPlan" : {
      "stage" : "PROJECTION",
      "transformBy" : {
        "_id" : 1
      },
      "inputStage" : {
        "stage" : "KEEP_MUTATIONS",
        "inputStage" : {
          "stage" : "FETCH",
          "filter" : {
            "geometry" : {
              "$geoWithin" : {
                "$geometry" : {
                  "type" : "Polygon",
                  "coordinates" : [...]
                }
              }
            }
          },
          "inputStage" : {
            "stage" : "IXSCAN",
            "keyPattern" : {
              "geometry" : "2dsphere"
            },
            "indexName" : "geometry_2dsphere",
            "isMultiKey" : true,
            "direction" : "forward",
            "indexBounds" : {
              "geometry" : [
                "[\"2f0332301\", \"2f0332301\"]",
                "[\"2f03323011\", \"2f03323011\"]",
                "[\"2f033230111\", \"2f033230112\")",
                "[\"2f033230112\", \"2f033230112\"]",
                "[\"2f0332301120\", \"2f0332301121\")",
                "[\"2f0332301121\", \"2f0332301121\"]",
                "[\"2f03323011210\", \"2f03323011211\")",
                "[\"2f03323011211\", \"2f03323011212\")",
                "[\"2f1003230\", \"2f1003230\"]",
                "[\"2f10032300\", \"2f10032300\"]",
                "[\"2f100323000\", \"2f100323001\")"
              ]
            }
          }
        }
      }
    },
    "rejectedPlans" : [ ]
  },
  "serverInfo" : {
    "version" : "3.0.4"
  },
  "ok" : 1
}

Which suggests an index is being used. If i try with a smaller area, the query does get faster, and slower with a larger area.

Here's my collection stats:

{
    "ns" : "db.output_areas",
    "count" : 181408,
    "size" : 3062445568,
    "avgObjSize" : 16881,
    "numExtents" : 22,
    "storageSize" : 3927183360,
    "lastExtentSize" : 1021497344,
    "paddingFactor" : 1,
    "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
    "userFlags" : 1,
    "capped" : false,
    "nindexes" : 2,
    "totalIndexSize" : 35606480,
    "indexSizes" : {
        "_id_" : 5894896,
        "geometry_2dsphere" : 29711584
    },
    "ok" : 1
}

I ran the db.setProfilingLevel(2) command, re-ran the query, then inspected the db.system.profile collection.

First record is the actual query ("op": "query")

then 7 more queries with ("op": "getmore") which i assume is fetching the rest of the data.

Each query yields 1000 rows ("nreturned": 1000), and each query has an average of 4000 millis.

I've read lots of questions where people are complaining about geojson queries taking > 2s with > 1m rows, so i'm obviously missing something simple.

royse41
  • 2,310
  • 4
  • 22
  • 29
  • This actually does seem to correlate to something I happen to be writing up right now, as I am well aware the current `$geoWithin` algorithm is not entirely optimal. In broad strokes, a good approach to to look at your "min" and "max" x,y verticies and first construct a bounding radius query via [**`$geoNear`**](http://docs.mongodb.org/manual/reference/operator/aggregation/geoNear/) from the aggregation framework. With these results in mind, and with respect to the maximumDistance modifier, you then filter the result set within the required polygon via `$geoWithin`. – Blakes Seven Aug 09 '15 at 11:28
  • Sorry for multiple comments, but space allows. Aggregation pipelines allow multiple stages, therefore an "optimal" `$geoNear` works best on the index at fist. Then you can restrict the polygon with `$geoWithin` to filter out things not within the bounds. I would post and answer, but still working all of the math for a concete example. But the basic principle does seem to yield the best results. – Blakes Seven Aug 09 '15 at 11:32

1 Answers1

0

Maybe, this is not a real answer, but may this solves the problem. Remove the index and try again, the query will be slower in small polygons but faster with larger polygons compared with the execution time while having the index.

Stavros Koureas
  • 1,126
  • 12
  • 34