How to create a compound index with geospatial then _id?

Question

I have a relatively straight forward query, yet I can't seem to create the proper index to make it the most efficient read that I can (or can't seem to instruct mongo to use my index). The query is:

const query = {
  'location.geoJson': {
    $geoWithin: {
      $centerSphere: [
        user.location.geoJson.coordinates,
        defaultRadiusInMiles / earthRadiusInMiles,
      ],
    },
  },
  _id: { $lt: lastId },
};

results = collections.myCollection.find(query).sort({ _id: -1 }).limit(limit);

The index I've created in attempt to make this query more efficient is

collections.myCollection.createIndex({ 'location.geoJson': '2dsphere', _id: -1 })

However, when I review the explainStats, I see the following:

 "winningPlan": {
      "stage": "LIMIT",
      ...
      "inputStage": {
        "stage": "FETCH",
        "filter": {
          "location.geoJson": {
            "$geoWithin": {
              "$centerSphere": [
              ...
              ]
            }
          }
        },
        "inputStage": {
          "stage": "IXSCAN",
          "keyPattern": {
            "_id": 1
          },
          "indexName": "_id_",

Which according to the documentation, indicates that mongo is first doing an index scan on _id, THEN fetching based on location, and lastly limiting the results, which is not what I want.

So, is this because my compound index is incorrect and not supporting this query? Or how can I force mongo to use my compound index? Lastly, the queryPlanner shows that the parsed query is:

"parsedQuery": {
      "$and": [
        {
          "_id": {
            "$lt": ...
          }
        },
        {
          "location.geoJson": {
            "$geoWithin": {
              "$centerSphere": [
                ...
              ]
            }
          }
        }
      ]
    },

Maybe because the first element in the $and array is the _id portion of the query, that's why it's executing that first? Any help whatsoever is appreciated.

EDIT: I should probably provide some context. I am attempting to implement newest to oldest pagination, that is why I am passing ObjectId to begin with. When I obtain the first batch of results (that is, the limit newest posts), I pass ObjectId('f'.repeat(24)) as lastId. This is when I observe the performance issues.

I know for a fact I have 110 documents within that radius, and only when I obtain this first batch are all 110 documents scanned, even though limit < 110. However, what is interesting is that when I obtain the next batch in the pagination (by passing the last ObjectId from the first batch), it does not scan all 110 documents and is super efficient, in that case it only views limit number of documents and keys. Could anyone help make sense why the first batch has performance issues but the second batch with a specific ObjectId does not?

Perhaps I should implement newest to oldest pagination some other way, I'm open to suggestions on that as well.

Here are the executionStats for when I obtain the first batch vs the second batch:

first batch

second batch

Alex Blex · Answer 1 · 2022-06-04T23:55:48.293

Change order of fields in your index:

collections.myCollection.createIndex({ _id: -1, 'location.geoJson': '2dsphere' })

or don't sort by _id

Mongo can sort only by index prefix https://www.mongodb.com/docs/v4.4/tutorial/sort-results-with-indexes/#sort-and-index-prefix. If you sort by _id and _id is not the first key in the index definition it will require expensive in-memory sort. The planer will try to avoid it at any cost.

You can try to force existing index with https://www.mongodb.com/docs/v5.0/reference/method/cursor.hint/#mongodb-method-cursor.hint and check execution stats.

UPDATE

So from the first batch we have:

"winningPlan": {
  "stage": "SORT",
  "sortPattern": {
    "_id": -1
  },
  "limitAmount": 10,
  "type": "simple",
  "inputStage": {
    "stage": "FETCH",
    "filter": {
      "location.geoJson": {...}
    },
    "inputStage": {
      "stage": "IXSCAN",
      "keyPattern": {
        "location.geoJson": "2dsphere",
        "_id": -1
      },
      "indexName": "location.geoJson_2dsphere__id_-1",
      "direction": "forward",
      .... 
      "indexBounds": {
        "location.geoJson": [ .... ],
        "_id": [
          "(ObjectId('ffffffffffffffffffffffff'), ObjectId('000000000000000000000000')]"
        ]
      }
    }
  }
},
"executionStats": {
  "executionTimeMillis": 6,
  "totalKeysExamined": 110,
  "totalDocsExamined": 110,

From the second one:

"winningPlan": {
  "stage": "LIMIT",
  "limitAmount": 10,
  "inputStage": {
    "stage": "FETCH",
    "filter": {
      "location.geoJson": { ... }
    },
    "inputStage": {
      "stage": "IXSCAN",
      "keyPattern": {
        "_id": 1
      },
      "indexName": "_id_",
      "direction": "backward",
      "indexBounds": {
        "_id": [
          "(ObjectId('629bb42f1a47f5dfc2641329'), ObjectId('000000000000000000000000')]"
        ]
      }
    }
  }
},
"executionStats": {
  "executionTimeMillis": 5,
  "totalKeysExamined": 10,
  "totalDocsExamined": 10,

The first batch uses index location.geoJson_2dsphere__id_-1, fetches all 110 documents, sort them by _id and returns first 10. It takes 6 millis.

The second batch uses index _id_, fetches top 10 that match the filter and returns them. It takes 5 millis

It's about selectivity. The filter in the first batch covers all _ids, so id index was rejected, and geoJson was the next choice. In this index documents are sorted by location first, so it had to sort all of them by _id to get top 10.

In the second batch it used the _id to avoid sorting, as documents are already sorted, so it just texted top 10 that fit into the sphere and returned them.

It did not affect performance much tho. 6 and 5 millis - I wouldn't bother really. Still I believe an index {_id, location.geoJson} would fit better.

Hello, thanks for your answer. I added an edit, could you check that out? Thanks :D — Isaac Torres, Jun 03 '22 at 01:46
Well, it slightly changes the narrative. Answering "how to create an index to support the query" is all about the index definition and some explanations. Answering "why the other query is faster" is another matter. Please provide at least the other query and the explain() to compare. — Alex Blex, Jun 03 '22 at 12:50
They're exactly the same query, the only difference is that when obtaining the first batch, I pass `ObjectId('f'.repeat(24))` whereas for the second query, I pass a specific `ObjectId` (specifically the last one obtained from the first batch). As for the `explainStats`, it's rather long, I'll post it but is there anything in specific that's critically important? — Isaac Torres, Jun 04 '22 at 04:44
Results of explain for the slow and fast queries is the only source of information to analyse to answer to the question "why". I agree with expectation that performance shouldn't be affected much, and am wondering how mongodb explains this. You can post is as a gist, I will include important parts in the answer. Please use "executionStats" or "allPlansExecution" as the explain parameter. Without explains I can only speculate that distribution of your 110 documents is uneven, i.e. thy were added first to the rather big collection. — Alex Blex, Jun 04 '22 at 12:56
I've added the executionStats, hopefully that helps Please let me know if you need anything else. — Isaac Torres, Jun 04 '22 at 19:53
Sorry @IsaacTorres, I don't need anything. I was under impression that you needed some help instead =) Anyway, I fail to see any performance issues in the execution stats. See my update. — Alex Blex, Jun 05 '22 at 00:00

How to create a compound index with geospatial then _id?

1 Answers1