6

I'm running lots of queries of such type:

db.mycollection.find({a:{$gt:10,$lt:100}, b:4}).sort({c:-1, a:-1})

What sort of index should I use to speed it up? I think I'll need to have both {a:1, b:1} and {c:-1, a:-1}, am I right? Or these indexes will somehow interfere with each other at no performance gain?

EDIT: The actual problem for me is that I run many queries in a loop, some of them over small range, others over large range. If I put index on {a:1, b:1}, it selects small chunks very quickly, but when it comes to large range I see an error "too much data for sort() with no index". If, otherwise, I put index on {c:-1, a:-1}, there is no error, but the smaller chunks (and there are more of those) are processed much slower. So, how is it possible to keep the quickness of selection for smaller ranges, but not get error on large amount of data?

If it matters, I run queries through Python's pymongo.

sashkello
  • 17,306
  • 24
  • 81
  • 109
  • Have you tried either what Sammaye or me suggested in our answer,and was that useful/helpful,as you write in the bounty description is there something in which we have to develop our answer, if so please ask and it will proceed. – attish Oct 06 '13 at 10:30
  • If you can share the real values behind `A`,`B`,`C`, it will help in giving a more specific answer. If for example `A` is `ISODate` and `B` is a flag that is not found in all the documents than I can easily optimize your query. – Oran Oct 07 '13 at 14:13
  • Dunno why you marked Orans answer, it is completely wrong, Attishs is the best you can hope for even with Orans second solution you will still get either scanandorder or unindexed find – Sammaye Oct 08 '13 at 07:18
  • @Sammaye Well, every answer here is flawed in some way. I did select that one because it did push me in a right direction of remodeling the way I query data. All in all, I moved from the query I initially intended and use several separate queries instead and use hint() to use index on sort() when I know the result is gonna be huge and hint() to find() when the result is small. All in all, I upvoted all of answers with good effort, but at the end Oran's ideas are those which did make it for me (yes, it is not entirely correct, but as well as other answers are not entirely correct as well). – sashkello Oct 08 '13 at 08:15
  • The guy have not said that all, what you done. May i ask that to please phrase yourself an answer (copy-paste the text above) with this and accept that while, it is really bad, that a completly bad answer with lots of technical mistakes and clear misunderstandings of the technology is accepted, one may think it is useful. – attish Oct 08 '13 at 10:04

5 Answers5

4

If you had read the documentation you would have seen that using two indexes here would have been useless since MongoDB only uses one index per query (unless it is an $or) until: https://jira.mongodb.org/browse/SERVER-3071 is implemented.

Not only that but also when using a compound sort the order in the index must match the sort order for a index to be used correctly, as such:

Or these indexes will somehow interfere with each other at no performance gain?

If intersectioning were implemented no they would not, {a:1,b:1} does not match the sort and {c:-1,a:-1} is sub-optimal for answering the find() plus a is not a prefix of that compound.

So immediately an iteration of a optimal index would be:

{a:-1,b:1,c:-1}

But this isn't the full story. Since $gt and $lt are actually ranges, like $in they suffer the same problem with indexes, this article should provide the answer: http://blog.mongolab.com/2012/06/cardinal-ins/ don't really see any reason to repeat its content.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • Should a be the last in index. The link was very helpful for us. Could bring down time to few milliseconds from thousands. Probably you will have to use hint if query is not using right index. – titogeo Sep 25 '13 at 07:58
  • @Sammaye Is it the general rule to have elements by which you sort by in reverse order? That is, I sort by (c, a) and in your suggestion a goes before c. – sashkello Sep 26 '13 at 04:46
  • I actually don't get why if there is only one index per query I had a visible improvement after having BOTH indexes implemented. Query wouldn't even go through without the index on sorting fields ("too much data for sort() with no index"). After I added it, each of my queries took about 10 seconds. Then I added index on a and b and it went down to 1 second. I don't get it... – sashkello Sep 26 '13 at 04:54
  • OK, so, if I do it like you suggest it doesn't recognize index for sort(). – sashkello Sep 26 '13 at 05:33
  • @sashkello I think, even though it is sub-optimal you have to put a and b first because of the compound find, yeah it doesn't recognise the sort because of c, I wrote this a bit quick actually and didn't truly think about it, the index on a and b will help because MongoDB can sort by a and then in memory sort by c without too many problems, hmm that might be the index your looking for – Sammaye Sep 26 '13 at 07:03
  • @sashkello of course with index intersectioning this will get a lot easier to index – Sammaye Sep 26 '13 at 08:01
  • Based on the article that you linked the best index is {c:-1,a:-1,b:1} as i mentioned. I update my answer, thanks for the source – attish Oct 02 '13 at 14:34
  • For those upvoting @attish answer is actually better and provides maybe the only index that will currently work, sub-optimally but still – Sammaye Oct 04 '13 at 15:21
  • Never mind, hopefully who really need help will read through the answers. – attish Oct 04 '13 at 21:38
  • If you follow jira link provided by answer, there is an update that mongo uses 2 indexes at same time, not one. `Index intersection is on by default. The query planner currently limits itself to 2 indices at a time, though this limit may increase in the future.` https://jira.mongodb.org/browse/SERVER-3071 – Lukas Liesis Sep 21 '16 at 05:08
4

Disclaimer: For MongoDB v2.4

Using hint is a nice solution, since it will force the query to use indexes that you chose, so you can optimize the query with different indexes until you are satisfied. The downside is that you are setting your own index per request. I prefer to set the indexes on the entire collection and let Mongo choose the correct (fastest) index for me, especially for queries that are used repeatedly.

You have two problems in your query:

  • Never sort on params that are not indexed. You will get this error: "too much data for sort() with no index" if the amount of documents in your .find() are very big, the size depends on the version of mongo that you use. This means that you must have indexes on A and C in order for your query to work.
  • Now for the bigger problem. You are performing a range query ($lt and $gt on param A), which can't work with Mongo. MongoDB only uses one index at a time, you are using two indexes on the same parameter. There are several solutions to deal with it in your code:

    1. r = range( 11,100 )
      db.mycollection.find({a:{$in: r }, b:4}).sort({c:-1, a:-1})

    2. Use only $lt or $gt in your query,
      db.mycollection.find({ a: { $lt:100 }, b:4}).sort({c:-1, a:-1})
      Get the results and filter them in your python code. This solution will return more data, so if you have millions of results with that are less then A=11, don't use it!
      If you choose this option, make sure you use a compound key with A and B.

Pay attention when using $or in your queries, since $or is less efficiently optimized than $in with it's usage of indexes.

Oran
  • 877
  • 7
  • 13
  • 1
    I had a similar problem with indexes when using `$gte` and `$lt` with timestamps for my queries. Using only one and then filtering after mongo returned results is a great idea. – Lix Oct 07 '13 at 14:12
  • Can you explain a bit why helps to use only $gt or $lt instead of using both in the same query, and what is the reason behind that the range based $in change the query to be faster? – attish Oct 07 '13 at 15:34
  • 1
    Using hint isn't really a good solution since it demands persistence of coding changes in your application. for example what if you want to take advantage of index intersectioning in MongoDB? You would have to rewrite all hinted queries to use a different hint, this breaks KISS and DRY paradigms that help to keep the sanity of programmers. Also hint doesn't set the index, it merely sets the index usage for that query, you still define indexes on collection level – Sammaye Oct 07 '13 at 16:22
  • 1
    BTW $or CAN USE INDEXES, it uses an index per clause – Sammaye Oct 07 '13 at 16:24
  • The sort error is nothing to do with ~100K rows, it is about 32meg of memory being used – Sammaye Oct 07 '13 at 16:29
  • Also the range solutions, `1` makes no sense, that produces the same problem and `2` is bad – Sammaye Oct 07 '13 at 16:32
  • `1` is using indexes, so it makes sense in some cases. If `a` uses only int values, or `a` is a time field and we need to get specific days. `2` is an awesome solution that solved the range issue, since only one index can be used at a time. Obviously it depends on the data in the collection and specific needs. In my case I needed a range query that includes all the data between yesterday and last 7 or last 30 days. That solution gave me 500% boost in performance instead of using both a range query. Both solutions were discussed with 10gen (mongodb dev team), but it depends on the collection. – Oran Oct 07 '13 at 20:14
  • Do you have some articles to suggest on the 2nd case while i still not see what is the reason while a single $gt can utilize an index and an interval query cannot use the same index, if both constraints for the same field. In a b-tree index such as mongodb use there should not be a difference – attish Oct 07 '13 at 20:22
  • I edited my answer to be more accurate. The [sorting issue is blocked by ~32MB](http://docs.mongodb.org/manual/reference/method/cursor.sort/), as @Sammaye mentioned. That is true, if you use and updated version of MongoDB 2.2+. The older versions supports a smaller size. – Oran Oct 07 '13 at 20:29
  • @attish Regarding interval indexes, "[A single query can only use one index](http://docs.mongodb.org/manual/core/index-types/), **except** for queries that use the $or operator that can use a different index for each clause." I recommend to read the mongodb [index concepts](http://docs.mongodb.org/manual/core/indexes/) to master query optimization.×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes – Oran Oct 07 '13 at 21:29
  • @attish Use explain() to test the solution and you will see that it works! The idea was raised at the last mongodb conference, in my country, by MongoHQ. Using $gt and $lt in the same query is not efficient, it is better to use only $lt and loop on the cursor until we reach the $gt value. [Check out his article](http://blog.mongohq.com/mongodb-indexing-best-practices/) – Oran Oct 07 '13 at 21:32
  • Index usage is always upper and lower bounded, so a $gt:20, is $gt:20, $lt:maxKey, if you just check the explain output what i attached you will see. (actually using given bounds fast up things not slows down. Check the code if you not belive: https://github.com/mongodb/mongo/blob/master/src/mongo/db/query/index_bounds.cpp)) What the article suggest is to prepare a day representation of a timestamp (without time part) as a bucket, to avoid the intervall queryiing at all (That would not be a bad idea, but not useful here and anyway you have not suggested at all). – attish Oct 07 '13 at 22:35
  • I garuantee that neither of those queries can make proper use of a single compound index, you will either get unindexed find or scanandorder – Sammaye Oct 08 '13 at 07:16
  • That is why my answer is incomplete, I realised that there is not "good" way of actually doing this – Sammaye Oct 08 '13 at 07:23
  • Both queries are working for me on 1TB of data and I do use compound key for the second scenario. It's a **creative solution** that was presented to 10gen and they liked it. The second solution was tested by a team of developers with **many** stress tests, so saying that it will not work is irresponsible. I do agree that it's a **complex issue** and there are different possible answers depends on the data and specific needs which are different for each developer. – Oran Oct 08 '13 at 14:45
  • When was it presented to mongodb Inc (they are no longer 10gen)? I have written 100's nay 1000's of these queries in the last 5 years of using MongoDB so I wouldn't mind seeing some of those stress tests myself...it would go against EVERYTHING that MongoDB Inc employees and my own knowledge states – Sammaye Oct 08 '13 at 14:51
  • Also I have noticed you never actually state an index in its full form in your answer which makes me wonder what index you would apply – Sammaye Oct 08 '13 at 14:59
  • I didn't mention the indexes on purpose, since the `a`,`b`,`c` fields are not clear enough. Moreover I don't have the sharding key that is being used, assuming that there is more than one shard. I use `a` as time in ISODate - mandatory field in my app, so my compound index includes `key : { "time" : 1, "b" : 1 }`, assuming that `b` needs to be indexed. – Oran Oct 08 '13 at 15:52
  • 1
    It is still 10gen, with a new name, so what's the difference ? Veteran developers who worked with mongo for years, still remember the old name, since the name change was on August 27, 2013. You can still see a reference to the old name in their [home page](http://www.mongodb.com/press/10gen-announces-company-name-change-mongodb-inc). – Oran Oct 08 '13 at 16:02
2

If you define an index {c:-1,a:-1,b:1} it will help with some considerations.

With this option the index fully will be scanned, but based on the index values only the apropriate documents will be visited, and they will be visited in the right order so the ordering phase will not be needed after getting the results. If the index is huge i do not know how it will behave, but i assume when the result would be small it will be slower in case of the resultset is big it will be faster.

About prefix matching. If you hint the index and lower levels are useable to serve the query those levels will be used for. To demonstrate this behaviour i made a short test.

I prepared test data with:

> db.createCollection('testIndex')
{ "ok" : 1 }
> db.testIndex.ensureIndex({a:1,b:1})
> db.testIndex.ensureIndex({c:-1,a:-1})
> db.testIndex.ensureIndex({c:-1,a:-1,b:1})
> for(var i=1;i++<500;){db.testIndex.insert({a:i,b:4,c:i+5});}
> for(var i=1;i++<500;){db.testIndex.insert({a:i,b:6,c:i+5});}

te result of the query with hint:

> db.testIndex.find({a:{$gt:10,$lt:100}, b:4}).hint('c_-1_a_-1_b_1').sort({c:-1, a:-1}).explain()
{
    "cursor" : "BtreeCursor c_-1_a_-1_b_1",
    "isMultiKey" : false,
    "n" : 89,
    "nscannedObjects" : 89,
    "nscanned" : 588,
    "nscannedObjectsAllPlans" : 89,
    "nscannedAllPlans" : 588,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 1,
    "indexBounds" : {
        "c" : [
            [
                {
                    "$maxElement" : 1
                },
                {
                    "$minElement" : 1
                }
            ]
        ],
        "a" : [
            [
                100,
                10
            ]
        ],
        "b" : [
            [
                4,
                4
            ]
        ]
    },
    "server" :""
}

Explanation of the output is the index is scanned that is why nscanned is 588 (number of scanned index entries and documents), the number at nscannedObjects is the number of the scanned documents. So based on the index mongo only reads those documents which match the criteria (the index partially covering or so). as you can see scanAndOrder is false so there is no sorting phase. (that implicates if the index is in memory that will be fast)

Along with the article what others linked : http://blog.mongolab.com/wp-content/uploads/2012/06/IndexVisitation-4.png you have to put first the sort keys in the index and the query keys after, if they have a subset match you have to include the subset in the very same order as they in the sorting criteria (while it does not matter for the query part).

attish
  • 3,090
  • 16
  • 21
  • I think whichever way you go this is gonna cause a problem, your index provides a good value for scanAndOrder and an alternate index provides much less optimal running but that find just cannot be indexed properly, I also realised that sorted cardinal ranges require the scanning of the entire index (as you said) which ight nerf a lot of speed here especially with also having to either pick a or b from the documents themselves – Sammaye Oct 02 '13 at 15:52
  • Oh no I misread the explain output, it isn't picking out document it doesn't need from disk – Sammaye Oct 02 '13 at 20:35
  • It read from disk only the appropriate docs which match the criteria but not leverage the structure of the index. It will scan the whole index anyway. If the resultset is small the {a:1, b:1} will be much faster. – attish Oct 02 '13 at 22:26
  • Yeah I misread the explian on that first bit, hmm though scanning the entire index defeats the point of the index, personally this does seem like a weak spot of MongoDB itself, hopefully something fixed with intersectioning – Sammaye Oct 02 '13 at 23:32
0

I think it will be better to change the order of the fields in find.

db.mycollection.find({b:4, a:{$gt:10,$lt:100}}).sort({c:-1, a:-1})

and then you add an index

{b:1,a:-1,c:-1}
Katja
  • 15
  • 1
  • 4
  • 1
    Changing the order of the fields in find has no effect on anything, why would you think it would? It is the order of the index that matters – Sammaye Oct 04 '13 at 07:57
  • @Sammaye - I thought it is faster to search for exact value and then the range. Am I wrong? – Katja Oct 04 '13 at 08:24
  • 1
    That's about index order not find order, MongoDB query optimiser will actually completely change your find order to match index order – Sammaye Oct 04 '13 at 08:41
0

I tried two different indexes,

one with index in the order of db.mycollection.ensureIndex({a:1,b:1,c:-1})

and the explain plan was like below

{
    "cursor" : "BtreeCursor a_1_b_1_c_-1",
    "nscanned" : 9542,
    "nscannedObjects" : 1,
    "n" : 1,
    "scanAndOrder" : true,
    "millis" : 36,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "a" : [
            [
                3,
                10000
            ]
        ],
        "b" : [
            [
                4,
                4
            ]
        ],
        "c" : [
            [
                {
                    "$maxElement" : 1
                },
                {
                    "$minElement" : 1
                }
            ]
        ]
    }
}

and other index with db.mycollection.ensureIndex({b:1,c:-1,a:-1})

> db.mycollection.find({a:{$gt:3,$lt:10000},b:4}).sort({c:-1, a:-1}).explain()
{
    "cursor" : "BtreeCursor b_1_c_-1_a_-1",
    "nscanned" : 1,
    "nscannedObjects" : 1,
    "n" : 1,
    "millis" : 8,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "b" : [
            [
                4,
                4
            ]
        ],
        "c" : [
            [
                {
                    "$maxElement" : 1
                },
                {
                    "$minElement" : 1
                }
            ]
        ],
        "a" : [
            [
                10000,
                3
            ]
        ]
    }
}
> 

I believe, since you are querying 'a' over a range of values and 'b' for a specific value I guess second option is more appropriate. nscanned object changed from 9542 to 1

Srivatsa N
  • 2,291
  • 4
  • 21
  • 36