6

I have a MongoDB collection named post with 35 million objects. The collection has two secondary indexes defined as follows.

> db.post.getIndexKeys()
[
    {
        "_id" : 1
    },
    {
        "namespace" : 1,
        "domain" : 1,
        "post_id" : 1
    },
    {
        "namespace" : 1,
        "post_time" : 1,
        "tags" : 1  // this is an array field
    }
]

I expect the following query, which simply filters by namespace and post_time, to run in a reasonable time without scanning all objects.

>db.post.find({post_time: {"$gte" : ISODate("2013-04-09T00:00:00Z"), "$lt" : ISODate("2013-04-09T01:00:00Z")}, namespace: "my_namespace"}).count()
7408

However, it takes MongoDB at least ten minutes to retrieve the result and, curiously, it manages to scan 70 million objects to do the job according to the explain function.

> db.post.find({post_time: {"$gte" : ISODate("2013-04-09T00:00:00Z"), "$lt" : ISODate("2013-04-09T01:00:00Z")}, namespace: "my_namespace"}).explain()
{
    "cursor" : "BtreeCursor namespace_1_post_time_1_tags_1",
    "isMultiKey" : true,
    "n" : 7408,
    "nscannedObjects" : 69999186,
    "nscanned" : 69999186,
    "nscannedObjectsAllPlans" : 69999186,
    "nscannedAllPlans" : 69999186,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 378967,
    "nChunkSkips" : 0,
    "millis" : 290048,
    "indexBounds" : {
        "namespace" : [
            [
                "my_namespace",
                "my_namespace"
            ]
        ],
        "post_time" : [
            [
                ISODate("2013-04-09T00:00:00Z"),
                ISODate("292278995-01--2147483647T07:12:56.808Z")
            ]
        ],
        "tags" : [
            [
                {
                    "$minElement" : 1
                },
                {
                    "$maxElement" : 1
                }
            ]
        ]
    },
    "server" : "localhost:27017"
}

The difference between the number of objects and the number of scans must be caused by the lengths of the tag arrays (which are all equal to 2). Still, I don't understand why post_time filter does not make use of the index.

Can you tell me what I might be missing?

(I am working on a descent machine with 24 cores and 96 GB RAM. I am using MongoDB 2.2.3.)

Zaid Masud
  • 13,225
  • 9
  • 67
  • 88
Eser Aygün
  • 7,794
  • 1
  • 20
  • 30
  • Has namespace got a very low cardinality? – Sammaye May 09 '13 at 11:20
  • Currently, there is only one distinct `namespace` value, which is the one I'm using. – Eser Aygün May 09 '13 at 11:20
  • Yea that is why, MongoDB must limit on the first field first, so it gets all `my_namespace` and then gets all docs between that date, etc etc, try reordering the index so the post_time is first – Sammaye May 09 '13 at 11:21
  • Doesn't the index tree contain `post_time` values too under the `my_namespace` branch? Why start scanning before narrowing down? – Eser Aygün May 09 '13 at 11:26
  • Just to be sure, I've created an index with key `{post_time: 1, namespace: 1}` as you suggested. It still uses the old index if I filter by both `namespace` and `post_time`. However, if I filter _only_ by `post_time`, it computes the result in a blink, as expected. – Eser Aygün May 09 '13 at 11:48
  • Hmm the old index must be winning the race condition, if you just try a `hint()` a sec does it all work as expected? – Sammaye May 09 '13 at 12:03
  • 2
    Even though the btree contains both fields compound indexs work in a certain way in MongoDB which means it will scan all namespaces for that value and then narrow down in date. Hmm trying to find a doc page that explains it well but there is no real good doc page on index internals from the google search though this post may help: http://emptysquare.net/blog/optimizing-mongodb-compound-indexes/ – Sammaye May 09 '13 at 12:06
  • I see your point and it makes sense if you are worried on the index optimality. What I was worried was that the index was not being used at all. Now, I've found the reason. Please see my answer. Thank you for the discussion and for the references. – Eser Aygün May 09 '13 at 12:23

1 Answers1

3

Found my answer in this question: Order of $lt and $gt in MongoDB range query

My index is a multikey index (on tags) and I am running a range query (on post_time). Apparently, MongoDB cannot use both sides of the range as a filter in this case, so it just picks the $gte clause, which comes first. As my lower limit happens to be the lowest post_time value, MongoDB starts scanning all the objects.

Unfortunately, this is not the whole story. Trying to solve the problem, I created non-multikey indexes too but MongoDB insisted on using the bad one. That made me think that the problem was elsewhere. Finally, I had to drop the multikey index and create one without the tags field. Everything is fine now.

Community
  • 1
  • 1
Eser Aygün
  • 7,794
  • 1
  • 20
  • 30
  • Dang I never knew that about `$gt` and `$lt` and miltikeys, nice find! – Sammaye May 09 '13 at 12:32
  • 1
    Using cursor.hint could also be a solution to make mongodb use the other index (http://docs.mongodb.org/manual/reference/method/cursor.hint/#cursor.hint) – rudi Sep 19 '14 at 14:49