3

If you have a double compound index { a : 1, b : 1}, it makes sense to me that the index won't be used if you query on b alone (i.e. you cannot "skip" a in your query). The index will however be used if you query on a alone.

However, given a triple compound index { a : 1, b: 1, c: 1} my explain command is showing that the index is used when you query on a and c (i.e. you can "skip" b in your query).

How can Mongo use an abc index on a query for ac, and how effective is the index in this case?

Background:

My use case is that sometimes I want to query on a,b,c and sometimes I want to query on a,c. Now should I create only 1 index on a,b,c or should I create one on a,c and one on a,b,c?

(It doesn't make sense to create an index on a,c,b because c is a multi-key index with good selectivity.)

Zaid Masud
  • 13,225
  • 9
  • 67
  • 88

2 Answers2

4

bottom line / tl;dr: Index b can be 'skipped' if a and c are queried for equality or inequality, but not, for instance, for sorts on c.

This is a very good question. Unfortunately, I couldn't find anything that authoritatively answers this in greater detail. I believe the performance of such queries has improved over the last years, so I wouldn't trust old material on the topic.

The whole thing is quite complicated because it depends on the selectivity on your indexes and whether you query for equality, inequality and/or sort, so explain() is your only friend, but here are some things I found:

Caveat: What comes now is a mixture of experimental results, reasoning and guessing. I might be stretching Kyle's analogy too far, and I might even be completely wrong (and unlucky, because my test results loosely match my reasoning).

It is clear that the index of A can be used, which, depending on the selectivity of A, is certainly very helpful. 'Skipping' B can be tricky, or not. Let's keep this similar to Kyle's cookbook example:

French
    Beef
        ...
    Chicken
        Coq au Vin
        Roasted Chicken
    Lamb
        ...
    ...

If you now ask me to find some French dish called "Chateaubriand", I can use index A and, because I don't know the ingredient, will have to scan all dishes in A. On the other hand, I do know that the list of dishes in each category is sorted through the index C, so I will only have to look for the strings starting with, say, "Cha" in each ingredient-list. If there are 50 ingredients, I will need 50 lookups instead of just one, but that is a lot better than having to scan every French dish!

In my experiments, the number was a lot smaller than the number of distinct values in b: it never seemd to exceed 2. However, I tested this only with a single collection, and it probably has to do with the selectivity of the b-index.

If you asked me to give you an alphabetically sorted list of all French dishes, though, I'd be in trouble. Now the index on C is worthless, I'd have to merge-sort all those index lists. I will have to scan every element to do so.

This reflects in my tests. Here are some simplified results. The original collection has datetimes, ints and strings, but I wanted to keep things simple, so it's now all ints.

Essentially, there are only two classes of queries: those where nscanned <= 2 * limit, and those that have to scan the entire collection (120k documents). The index is {a, b, c}:

// fast (range query on c while skipping b)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }});
// slow (sorting)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "c" : -1});
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "b" : -1}); 

// fast (can sort on c if b included in the query)
> db.Test.find({"a" : 43, "b" : 7887, "c" : { $lte : 45454 }}).sort({ "c" : -1});

// fast (older tutorials claim this is slow)
> db.Test.find({"a" : {$gte : 43}, "c" : { $lte : 45454 }});

Your mileage will vary.

mnemosyn
  • 45,391
  • 6
  • 76
  • 82
  • Thank you, this is probably the best information we can get given the (well noted) lack of authoritative references. In my specific case B is a low selectivity column and I am doing a "begins with" query on C. So I suppose based on your answer my ABC index is almost just as good an AC index for a query on AC? I know mileage will vary so I might just go with both AC and ABC indexes for the time being... – Zaid Masud Jun 19 '12 at 08:48
  • I guess it will be almost as good as an AC-index as long as you don't sort on C, but `explain()` can tell you for sure. Keep an eye on `nscanned`, it shouldn't be much larger than `n`. – mnemosyn Jun 19 '12 at 09:03
1

You can view querying on A and C as a special case of querying on A (in which case the index would be used). Using the index is more efficient than having to load the whole document.

Suppose you wanted to get all documents with A between 7 and 13, and C between 5 and 8.

If you had an index on A only: the database could use the index to select documents with A between 7 and 13 but, to make sure that C was between 5 and 8, it would have to retrieve the corresponding documents too.

If you had an index on A, B, and C: the database could use the index to select documents with A between 7 and 13. Since the values of C are already stored in the records of the index, it could determine whether the correponding documents also match the C criterion, without having to retrieve those documents. Therefore, you would avoid disk reads, with better performance.

Eduardo
  • 8,362
  • 6
  • 38
  • 72
  • In other words, are you saying that when you query on AC then an ABC index is no better than an index on A alone? – Zaid Masud Jun 18 '12 at 08:50
  • It is better because it can filter out the unwanted values in C from the index itself, without having to read the document. – Eduardo Jun 18 '12 at 13:50
  • -1 @Eduardo I don't think I am following much of your reasoning here. How can it use the C part of the index? It's pretty misleading to think about the objective of an index to not "load the whole document." In fact, you should be thinking about it in terms of the number of documents that need to be scanned (nscan). – Zaid Masud Jun 18 '12 at 14:57
  • If your query was on A and C, and you only had an index on A, then the database would have to read all the documents where A matched the query, to be able to determine if their C value matched the query too. Since your index will typically be much smaller than the collection, using the index alone (with A and C) to determine which documents match the query will result in less disk reads and therefore better performance. – Eduardo Jun 18 '12 at 15:52
  • Sorry but this is simply not how mongo indexes work. Please read http://kylebanker.com/blog/2010/09/21/the-joy-of-mongodb-indexes/. Since C is ordered in groupings of B and is not sorted on its own, in your answer you need to show how C is usable. – Zaid Masud Jun 18 '12 at 16:38
  • It does not matter if the C part of the index is not sorted on its own. The main search is done on A. If the A part matches then the C part has to be analyzed. To analyze C, it is better to have it in the index than to have to read the corresponding document. – Eduardo Jun 18 '12 at 19:52