14

I'm looking for a tool to get a decent estimate of how large a MongoDB index will be based on a few signals like:

  • How many documents in my collection
  • The size of the indexed field(s)
  • The size of the _id I'm using if not ObjectId
  • Geo/Non-geo

Has anyone stumbled across something like this? I can imagine it would be extremely useful given Mongo's performance degradation once it hits the memory wall and documents start getting paged out to disk. If I have a functioning database and want to add another index, the only way I'll know if it will be too big is to actually add it.

It wouldn't need to be accurate down to the bit, but with some assumptions about B-Trees and the index implementation I'm sure it could be reasonable enough to be helpful.

If this doesn't exist already I'd like to build and open source it, so if I've missed any required parameters for this calculation please include in your answer.

Naman
  • 27,789
  • 26
  • 218
  • 353
jpredham
  • 2,149
  • 1
  • 23
  • 37
  • It may be worthwhile to coincide your tool (to fill the gap in the interim) with a request for a built-in tool from the MongoDB team. – Derek Litz Dec 23 '11 at 15:32
  • Did you actually end up writing a tool for this? – Stennie Aug 23 '12 at 09:09
  • I did, however the results were less than satisfactory. When tested with real data with existing indexes for comparison, my tool would predict index sizes to be slightly less than twice the actual sizes. I'm investigating whether this is a bug in my code or if the formula is just *very* rough. Will update here when I find out more. – jpredham Aug 30 '12 at 21:37
  • @Stennie I might be pulling back an old thread. But is there an official comprehensive way to identify this? – Naman Feb 04 '21 at 14:53
  • 1
    @Naman Tyler's answer from 2011 describes the original MMAP storage engine circa MongoDB 2.0, but this formula definitely isn't applicable to modern versions of MongoDB. WiredTiger, the default storage engine in MongoDB 3.2+, uses index prefix compression so index sizes will vary based on the distribution of key values. There are also a variety of index types and options which might affect sizing. The best approach for a reasonable estimate would be using empirical estimation with representative test data for your projected growth, so I would put your upvotes on Ostati's answer from 2014. – Stennie Feb 05 '21 at 10:59
  • @Stennie thank you for pitching in..the actual collection for my use case might have `500million` documents and above. How much data do you think would be sufficient for me to derive a correlation in terms of the actual index sizes that I could end up with? e.g. would `1000` documents be sufficient for me to create the same index on stage environment and then do that maths of multiplying the size by `500,000`? – Naman Feb 07 '21 at 16:44
  • @Naman 1,000 indexed values is far too small to be useful to extrapolate a ratio for any larger populations. Depending on the indexed values & distribution (which affects prefix compression), the file size of small indexes is likely to be dominated by block allocations (eg 16k) which are not significant for reasonable sized indexes. I'd be sampling at larger values like 100k, 500k, 1m to extrapolate a trend, but there are other factors (like workload) that will affect the ratio over time. – Stennie Feb 12 '21 at 23:10
  • @Naman I suspect your underlying concern is how indexes will affect your working set, which we can follow on up via the related discussion you started on the MongoDB Community Forums: [Assess the right amount of cacheSize for wiredTiger upon index addition](https://developer.mongodb.com/community/forums/t/assess-the-right-amount-of-cachesize-for-wiredtiger-upon-index-addition/15805?u=stennie). – Stennie Feb 12 '21 at 23:11
  • @Stennie True. That is what my current concern is. More than happy to share further details there. Thank you for your responses here, I do understand that sampling would have to be done on a significant number and that performing on a closer data size would be much more accurate. I have tried to correlate that as an approximation to define the size for now. The question on the forum, continues ahead. – Naman Feb 13 '21 at 03:01

4 Answers4

23

I just spoke with some of the 10gen engineers and there isn't a tool but you can do a back of the envelope calculation that is based on this formula:

2 * [ n * ( 18 bytes overhead + avg size of indexed field + 5 or so bytes of conversion fudge factor ) ]

Where n is the number of documents you have.

The overhead and conversion padding are mongo specific but the 2x comes from the b-tree data structure being roughly half full (but having allocated 100% of the space a full tree would require) in the worst case.

I'd explain more but I'm learning about it myself at the moment. This presentation will have more details: http://www.10gen.com/presentations/mongosp-2011/mongodb-internals

Naman
  • 27,789
  • 26
  • 218
  • 353
Tyler Brock
  • 29,626
  • 15
  • 79
  • 79
  • 3
    He can create an online calculator then :-) – Sergio Tulentsev Dec 22 '11 at 23:19
  • Sorry, need to open this question up again. By calculating the average field size from a representative number of documents, and plugging it into the equation listed, I get index sizes roughly double the actual value. The theory makes sense to me here, but in practice, based on what the mongo shell is reporting anyway, this isn't correct. – jpredham Sep 04 '12 at 14:34
  • How many documents, is it a large enough sample? Please provide an example. Actual size can obviously vary based on many different factors. – Tyler Brock Sep 05 '12 at 17:03
  • It just occurred to me that we probably allocate space for the maximum indexed field size in the 4k index bucket even though you are using about half of it in practice so the actual index size is roughly double. – Tyler Brock Nov 14 '12 at 04:30
  • Hey @TylerBrock, could you please tell me what `avg size of indexed field` means? If my doc looks like this `{ _id : 1, favoriteFood : "cheese" }` and I indexed on `favoriteFood`, would the "avg size of indexed field" be 12 since it has 12 characters? – Kevin Meredith Oct 02 '13 at 20:49
  • Hey @Kevin. In your example the avg indexed field size would be closer to 6 bytes as the field would be "cheese" and the index entry would look like "cheese" -> . – Tyler Brock Oct 02 '13 at 22:05
4

Another way to calculate is to ingest ~1000 or so documents into every collection, in other words, build a small scale model of what you're going to end up within production, create indexes or what have you and calculate the final numbers based on db.collection.stats() average.

Edit (from a comment):

Tyler's answer describes the original MMAP storage engine circa MongoDB 2.0, but this formula definitely isn't applicable to modern versions of MongoDB. WiredTiger, the default storage engine in MongoDB 3.2+, uses index prefix compression so index sizes will vary based on the distribution of key values. There are also a variety of index types and options which might affect sizing. The best approach for a reasonable estimate would be using empirical estimation with representative test data for your projected growth.

Naman
  • 27,789
  • 26
  • 218
  • 353
Ostati
  • 4,623
  • 3
  • 44
  • 48
3

You can check the sizes of the indexes on a collection by using command:

db.collection.stats()

More details here: http://docs.mongodb.org/manual/reference/method/db.collection.stats/#db.collection.stats

Minh Nguyen
  • 490
  • 1
  • 3
  • 8
0

Best option is to test in non-prod deployment!

Insert 1000 documents and check index sizes , insert 100000 documents and check index sizes and so one.

Easy way to check in a loop all collections total index sizes:

  var y=0;db.adminCommand("listDatabases").databases.forEach(function(d){mdb=db.getSiblingDB(d.name);mdb.getCollectionNames().forEach(function(c){s=mdb[c].stats(1024*1024).totalIndexSize;y=y+s;print("db.Collection:"+d.name+"."+c+" totalIndexSize: "+s+" MB"); })});print("============================");print("Instance totalIndexSize: "+y+" MB");
R2D2
  • 9,410
  • 2
  • 12
  • 28
  • surely a brute force way of doing it, but not scalable when I am speculating to add one or more indexes on an existing database that includes millions of documents. by the way, I could just perform stats as well to extract the correct information once I would set up that same documents on staging. – Naman Feb 05 '21 at 02:06
  • Also, if there is very intensive updates and deletes the allocated space for documents and indexes can vary alot... – R2D2 Feb 07 '21 at 14:53