MongoDB capacity planning

Question

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)

To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)

I'd like to have an HA setup :)

Thanks in advance!

score 0 · Answer 1 · answered Jul 17 '12 at 09:58

0

For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.

Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability.. but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.

answered Jul 17 '12 at 09:58

Aafreen Sheikh

4,949
6
33
43

1

Actually, each MongoDB [shard](http://www.mongodb.org/display/DOCS/Sharding+Introduction) is a replica set .. so at a minimum should be two replicas (with data) plus an arbiter. So that would actually be 300-450Gb per shard since the original question referenced HA. Secondly .. even distribution of data relies on [choosing a good shard key](http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key), so you can't actually assume that the distribution will be balanced. – Stennie Jul 17 '12 at 14:10
You are right. My assumption was that the shard key should be chosen in such a way that data gets evenly distributed across all shards so that there are no hotspots..But I dont think each shard needs to have the capacity to store the whole database..Or is it so?As far as POC is concerned, replication is not required. But if you are talking about a production database, replication is must and minimum three nodes per shard are recommended (odd number to avoid a tie in election of primary node). – Aafreen Sheikh Jul 17 '12 at 14:49
Each shard does not need the capacity to store the whole database, however if the goal is HA then the shard should be configured with a proper supporting replica set. With a replica set you need to have at least one (and ideally two) secondaries in that replica set which have a full copy of the data for *that shard*. If replication isn't required for the POC, than sharding should not be either .. otherwise you are not testing the intended environment with redundancy and failover. – Stennie Jul 17 '12 at 21:34
Hmmm..guess it depends on what the POC is targeted at. If it is targeted on testing MongoDB's data model over the one in Oracle, then you need not get into those complexities.. But in this case, if the POC is intended to test Mongo's failover capabilities, then yes, replication should be tried out. – Aafreen Sheikh Jul 18 '12 at 08:53

score 0 · Answer 2 · answered Jul 17 '12 at 10:10

Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.

score 0 · Answer 3 · answered Jul 17 '12 at 14:42

It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).

It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.

Some notes to get you started:

MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.

Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.

If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.

Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

MongoDB capacity planning

3 Answers3