What is the correct way to Index in MongoDB when big combination of fields exist

Question

Considering I have search pannel that inculude multiple options like in the picture below:

I'm working with mongo and create compound index on 3-4 properties with specific order. But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).

The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)

My question is:

Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.

What is the best strategy to deal with big various of fields combinations.

I use same structure queries with different combinations of pairs

// Example Query. 
// fields could be different every time according to user select (and order) !!

 db.getCollection("orders").find({
  '$and': [
    {
      'status': {
        '$in': [
          'XXX',
          'YYY'
        ]
      }
    },
    {
      'searchId': {
        '$in': [
          '3859447'
        ]
      }
    },
    {
      'origin.brand': {
        '$in': [
          'aaaa',
          'bbbb',
          'cccc',
          'ddd',
          'eee',
          'bundle'
        ]
      }
    },
    {
      '$or': [
        {
          'origin.carries': 'YYY'
        },
        {
          'origin.carries': 'ZZZ'
        },
        {
          'origin.carries': 'WWWW'
        }
      ]
    }
  ]
}).sort({"timestamp":1})

// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}

but it only 1 combination ...it could be plenty like

a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........

Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...

The queries pattern are :

1.find(...) with '$and'/'$or' + sort

2.Aggregation with Match/sort

thanks

The best strategy is to use indexes that supports your queries. It's not always possible in real life tho. It appears you are asking about one of such cases so you have to compromise somewhere. It's where details matter - computational resources, db size, query pattern, other constrains. With such a generic question, the answers gonna be equally generic - https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes and https://docs.mongodb.com/manual/core/aggregation-pipeline/#aggregation-pipeline-operators-and-performance — Alex Blex, Jun 08 '20 at 13:41
As I said, my queries are same structure and can be all of these permutations of what user can search ...like :types/brand/warehouse/etc'... — VitalyT, Jun 08 '20 at 14:04
Dear Vitaly, I trust the image clearly shows the problem to the ones who is familiar with the business requirements behind the screenshot. If you want to cover wider audience I encourage you to elaborate a little bit here - what are the flags, how many sorts, relevant document structure etc. The phrase "it can be 10-20 compound or 1-3 big compound indexes" makes no sense without details that you somewhy shay to share. Can you give a hint what indexes you have in mind? — Alex Blex, Jun 08 '20 at 14:21
To give you an idea what I see on the image looks like a filter by 10 fields + free text search. Even ignoring the free search and sorting order, 2^10 gives 1024 different combinations of filters. Half of them can be covered by index prefixes, which still leaves 511 compound indexes + 1 simple index. Even 10th part of it will kill your database on first inserts so you really need to think which indexes to create. For that you need to know cardinality of data and query patterns - the more frequent queries should have better indices. — Alex Blex, Jun 08 '20 at 17:05
Tahnks @Alex Blex , Let's eliminate free search (goes to ES) and flags. I mean only the select boxes (close groups). I need only the compund index (brand,Types,Status,Warehouse,Carries). There could be combination. I've put an example query — VitalyT, Jun 08 '20 at 18:53

score 4 · Answer 1 · answered Jun 08 '20 at 17:37

Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.

What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.

With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.

Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.

Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.

Continue in this manner with other popular queries and fields.

Next I suggest compiling the top 10 most frequently used queries. — D. SM, Jun 08 '20 at 21:11
Why top top and not top 5 or 15? How did you decide? What will happened with performance ? It means in your scenario 10 quries will work fast other slow ? — VitalyT, Jun 09 '20 at 02:53

score 2 · Answer 2 · answered Jun 09 '20 at 11:50

2

So you have subdocuments, ranged queries, and sorting by 1 field only.

It can eliminate most of the possible permutations. Assuming there are no other surprises.

D. SM already covered selectivity - you should really listen what the man says and at least upvote.

The other things to consider is the order of the fields in the compound index:

fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc

These are common rules for all b-trees. Now things that are specific to mongo:

A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:

{
    _id: ...,
    status: ...,
    timestamp: ....,
    origin: [
        {brand: ..., carries: ...},
        {brand: ..., carries: ...},
        {brand: ..., carries: ...}
    ]
}

For your query the best index would be

{
  searchId: 1,
  timestamp: 1,
  status: 1, /** only if it is selective enough **/
  "origin.carries" : 1 /** or brand, depending on data **/
}

Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.

Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.

answered Jun 09 '20 at 11:50

Alex Blex

34,704
7
48
75

Thanks for your response - but still not clear. For that particular Query you build 1 compound index. What happen if the user select other cases the query will be different and also the order. should I build another compound index with different order ? (can be even same fileds). I see in execution plan the order of fields not const. Should I build index according the appearance in explain() or according to my query. BTW - why you start the index with search and not with type in my case. Thanks :):) – VitalyT Jun 09 '20 at 16:32
1

You know, once upon a time there was a guy called Leo Tolstoy who spent some 20 years to write a monumental novel "War and Peace". When a young lady asked him what is the book about, the poor fella stuck with an answer. After a very long pause he couldn't come up with anything better than: "I beg your pardon, but it would be simpler to write it once again than summarise it in few words". – Alex Blex Jun 09 '20 at 17:15
You see, good half of my answer describes in which order to put fields in the compound index and I doubt I can explain it better. The followup questions clearly show that the message was not fully comprehended. Let's chalk it down to language barrier and communication problems. Try to read https://docs.mongodb.com/manual/indexes/ It is a bit longer read but it explains how mongo deals with compound indexes in details. – Alex Blex Jun 09 '20 at 17:21
Order of fields in your indices should be as in the list in my answer after words "the order of the fields in the compound index:". searchId is the first field because it is $in with a single item in the query. It is translated to direct match `searchId: {$eq: ...}` and should go first. – Alex Blex Jun 09 '20 at 17:39
got you. Last question , if i have another query which contains combination of only 2 fileds (e.g origin.carries & origin.brand ) - should i create another compound index , and so on ... or rely the previous that already created – VitalyT Jun 10 '20 at 06:40
Please learn how index prefixing works in mongo https://docs.mongodb.com/manual/core/index-compound/#prefixes If you have a query with 2 fields, and a compound index where these 2 fields are the first 2 it can be used, T&Cs apply - selectivity, ranges, multikeys, sorting. If any of these conditions make the index less performant than alternatives Mongo planner will reject it. – Alex Blex Jun 10 '20 at 08:14

score 2 · Answer 3 · answered Jun 17 '21 at 12:07

I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.

Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter). When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size... Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.

So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:

we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.

Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col". The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast. The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.

This way of working would eliminate the burden of defining many index combinations.

What is the correct way to Index in MongoDB when big combination of fields exist

3 Answers3

Linked