0

On the Google Cloud Compute Engine we have the option to choose between three 'Machine families'

  • General-purpose
  • Compute-optimised
  • Memory-optimised

What would be the best option to choose for a MongoDB instance?

Enzo
  • 1
  • 1
  • Define how much resources your MongoDB instance requires (CPU, Memory, Network, Disk size and Disk IOPS). Then you will know which instance to select. A tiny instance of MongoDB will run just fine in a container. A large instance might require 16 vCPU, 96 GB of memory and several TBs of disk. In general databases need memory, so start with the Memory-optimized instance types. MongoDB does almost everything in memory. – John Hanley Aug 07 '22 at 08:07
  • Requests for product, service, or learning material recommendations are off-topic because they attract low quality, opinionated and spam answers, and the answers become obsolete quickly. Instead, describe the business problem you are working on, the research you have done, and the steps taken so far to solve it – djdomi Aug 07 '22 at 17:43

1 Answers1

1

In addition to @John Hanley to their comment, I actually found a very helpful article for "MongoDB Sizing Guide"

Motivation

When building an application using MongoDB as its data platform at some point the question arises what MongoDB cluster size should we start with and what are the implications on its size when the business grows over time. How much RAM, Storage and CPU is required? Should we use sharding to scale out? If so how many shards do we need? This article explains how to estimate the size of your MongoDB cluster. I will make a couple of assumptions in order to get a basis for my calculations. One assumption is that all indexes should fit into cache. Obviously the world is not just black and white and in many cases we need to adjust our assumptions. For my estimation I use this as a starting point as it guarantees to a certain extend good query performance. You should always revise your estimation once you have more data/metrics available to adjust your cluster size. Hardware aspects There are a number of key aspects that we need to consider when sizing a MongoDB cluster. First and foremost there is the cache. MongoDB's performance heavily relies on caching and therefore memory is the most valuable hardware resource for MongoDB. Any MongoDB sizing exercise should start with a memory estimation. Secondly there is storage. Obviously storage is a crucial component of a MongoDB deployment as it is where we persist our data. Fortunately storage is relatively cheap compared to RAM. Thirdly there is the CPU which is used for many aspects of MongoDB workloads. E.g if we aggregate data in MongoDB this can add considerable load to the CPU. So CPU is important too. And last but not least there is the network. MongoDB is a distributed database, so a lot of data needs to be transferred over the network. Network sizing and monitoring isn't part of this article as it is beyond its scope. Generally speaking, poor latency and small bandwidth can lead to poor MongoDB performance.

Sizing

In my experience the best way to do MongoDB size estimation is with a reasonably large test data set. Quite often we don't have such a test data set at hand. In this case I usually start with building one. I use mgeneratejs or mgenerate4j to generate dummy data into a small MongDB Atlas cluster. MongoDB Atlas comes in very handy as I can quickly spin up a MongoDB cluster without having to provide any hardware and install any software. Usually I generate between 1-10% of the target data set size. I then create the indexes that are required to support my application queries. It is crucial that we have a good understanding of what queries our application will run and how we support those queries with indexes. From that generated dummy data set I extract the index size, storage size and data size and project the corresponding target sizes.

Example: If I have a dummy data set that is 1% of the size of the target data set size:

Target index size (TIS) = dummy index size / 1%

Target storage size (TSC) = dummy storage size / 1%

Target data set size (TDS) = dummy data set size / 1%

Next I calculate my working set. Here we need to make some assumptions about the most frequently used data. E.g. if we keep 1 year worth of data in MongoDB but only frequently look at data of the last 5 days and also probably not at all data of the last 5 days but just a subset of maybe 40%. Then I calculate my working set as follows:

Frequently used data in days (FDD) = 5

Total data set days (TDSD) = 365

Subset percentage (S%) = 40%

Target data set size (TDS) = 100 GB

Working set (WS) = FDD / TDSD * S% * TDS = 5 / 365 * 40% * 100 GB = 0.55 GB

I then use those numbers to calculate memory, storage size and number of shards I need for my MongoDB cluster.

RAM

So let's start with the memory calculation. In its default configuration MongoDB reserves 50% from the physically available memory for the WiredTiger cache. More precisely the WiredTiger cache reserved is the available memory minus 1 GB divided by 2. So if we have 64 GB of RAM the WiredTiger cache is 31.5 GB. This behaviour can be changed, but in 99% of use cases this setting is correct and shouldn't be altered.

The WiredTiger Storage engine has a number of target thresholds it tries to achieve. One of those thresholds defines that the WiredTiger storage engine will try to keep the cache to 80% full. Once it reaches 80% it will start evicting blocks from the cache. This means in fact only roughly 40% of the available memory is used for cached documents and indexes.

To achieve good query performance we should allow all the indexes plus the working set to remain in the WiredTiger cache. Or in other words the memory required for good performance is 250% of the sum of all indexes and the working set.

How does this apply to sharding? Well, simply if we require 128 GB RAM we could either run a single replica set with each host/replica having 128 GB RAM or we could run a sharded cluster with two shards, each shard having 64 GB of RAM on each replica set host. This raises a new question: What is the ideal shard size and shard number? Well again this depends on a number of factors. Cost being one of them. E.g. The MongoDB enterprise licence caps at 256 GB RAM. Each MongoDB instance with up to 256 GB of RAM, requires 1 licence. For an instance with 300 GB RAM, 2 licences are required. So the sweet spot from a licensing perspective is a multiple of 256 GB of RAM per shard. But licence cost is not the only aspect when it comes to shard sizing. Parallelism is an important consideration as well. With more shards we can parallelise disk IO and CPU usage and therefore scale out disk IO and CPU capacity. This has also an effect on backup recovery times as with more shards we can recover faster from backup. On the other hand the more shards we have the higher the complexity of managing the cluster becomes. We have more coordination network traffic between the components and a higher probability of failing components.

To conclude, the right number of shards depends on many factors, not only memory. We need to consider shard sizing on a case by case basis. For simplification, the sizing sheet provided towards the end of this article optimises for licence cost.

RAM formula:

The simple formula to calculate required RAM for MongoDB I use is the following:

Total index size (TIS) Working set size (WS)

Total cluster memory (TCM) = (TIS + WS) * 250% + 1 GB

Example: If we have a total index size of 180 GB and a working set of 20 GB.

TCM = (180 GB + 20 GB) * 250% + 1 GB = 501 GB

These 501 GB of RAM required can now be distributed to multiple shards ideally of equal size. To optimize for licence cost we would deploy 2 shards, each having ~250GB RAM.

Storage-size

Storage size estimation is quite straightforward. Use your test data set as a baseline and look at your db-stats storage size. The formula I use here is the following: Test data storage size (TDSS) Test data document count (TSDC) Target data document count (TADC) Buffer percentage (B%) = usually 70%

Total storage size (TSS) = TDSS / TSDC * TADC / B%

Example: If I have 10 million documents in my test data set and the total storage size of that data set is 10 GB and the target system should store 1 billion documents, then it would require ~1.4 TB disk space.

TSS = 10 GB / 10'000'000 * 1'000'000'000 / 70% = ~1.4 TB

Note: The storage can be distributed over multiple shards. As a baseline I use max 1 TB per shard. So in this example I would deploy at least 2 shards, obviously also depending on how much total cluster memory (TCM) is required.

Why 1 TB max? Well we could go higher here. MongoDB Atlas e.g. allows up to 4 TB per shard, but 1 TB is a bit easier to manage, especially if it comes to backup recovery and initial syncs.

Storage IOPS

If we use SSD's we usually don't run into an [IOPS][8] bottleneck as long as we have our queries covered by indexes and enough memory and CPU. Cheaper SSD's start at around 3500 IOPS. But still we should watch out for consumed IOPS when running performance- and stress tests. If for example read IOPS are constantly high and near the max supported IOPS of your disks, it might be that unnecessary data gets read from disk due to collection scans or insufficient memory.

CPU

Unfortunately CPU is a bit more difficult to estimate. I usually start with a reasonable baseline of

Total Cluster Cores (TCC) = TCM / 4

Note: I didn't pull that formula out of a hat, it is the formula MongoDB Atlas uses for the standard cluster tiers. Important is that we have a reasonable baseline. The next step is to run performance- and stress-tests. Make sure the average CPU doesn't go too high. I suggest not to go above 33% average CPU usage.

Conclusion

MongoDB size estimation isn't that complicated if we know how MongoDB caching, indexing and sharding works and if we know how to generate test data. I have created a [sizing tool][9] in the form of google sheet which can be used as a starting point to do the MongoDB sizing.

It is worth mentioning that this article describes how to "estimate" the "initial" size of a MongoDB cluster. Every use case is slightly different and can result in different load patterns and resource requirements. The only accurate way to get a MongoDB cluster sized correctly is to repetitively test, observe and refine.

Jeffrey D.
  • 111
  • 4