Replicated Sharding in mongodb

Question

Hey i got a mongodb setup with 3 shards each with 3 replica running on 3 physical servers.The sharding is based on a category id on range so that data is even within the shards

The data i get each week onto the database is huge and i am only querying mostly data for the current or 2 previous days.

So i was trying to configure a shard with no replica to the current setup so that the new shard will contain old data of before 5 days and the old ones the 3 shards in the current setup will have the last 5 days data only.

If this is possible most of the queries will hit the not so big 3 shards and only rare queries hit the replica server for back and there would be some advance in TPS.

Is this anyways possible to configure in mongodb with sharding or replication??

Thanks in Advance

Hmm archival sharding like this isn't super reliable and is definitely not a standard part of MongoDB, however, tag based sharding could do this — Sammaye, Sep 03 '13 at 07:27
Hey thanks for the comment.Seems i missed off on tag based sharding.It seems it solves the problem me have. — Jiby Jose, Sep 03 '13 at 07:49
@Sammaye this answers my question.Please add your comment as answer and i will mark it as the answer — Jiby Jose, Sep 03 '13 at 07:51
@Sammaye I went through the shard tags and it seems it can define constant ranges, but would it be possible based on relative dates,like last 2 days to last 5 days and so on ? — Jiby Jose, Sep 03 '13 at 08:02

score 1 · Answer 1 · answered Sep 04 '13 at 04:06

While it might be tempting to use tag aware sharding for this, it's actually not simple, nor is it very efficient. Here is why:

1) your range of keys which should exist on the "old" shard is changing every day. If your cut-off is five days ago, at midnight you will need to update the tags to reflect that it's a new day.

2) as soon as you add the day that was five days ago to the range that should be on the "old" shard the balancer process will need to migrate that data to the old shard. The problem is that this shard will have loads of old data so probably really huge indexes so it'll be much slower to write to it, and reading and removing data from day-5 from your "active" shard(s) may be interfering with the queries on "current" data.

So, maybe it's not such a great option - although it is a valid option to consider.

I would suggest considering something else - maybe insert the data into this cluster and also into another "archival" replica set and then use TTL (time to live) index to "expire" data after it gets to be older than, say, a week. Just something to consider if you don't actually need to query on older data very often.

Another option is leave things the way they are. If your data is well balanced, it means you are already handling more TPS than you would if you were querying against "old" data - remember, only data actually being used is loaded into physical RAM - if you aren't reading some old data, then it'll just quietly sit there on disk. Just make sure that all your queries are using indexes efficiently - a collection scan can negate what I described in an instant!

So we cant give a dynamic date on tag range so that the date is considered each time according to the current time the balancing is done? — Jiby Jose, Sep 04 '13 at 04:11
Hey about the archival replica set i need to manually handle the replica right inserting to db farm as well as the archival replica and routing query manually between them?or does mongodb does provide such feature? — Jiby Jose, Sep 04 '13 at 04:16
There is no way to give anything other than static values for tags. And if you really need to query old data frequently I wouldn't recommend using the option of putting it in a separate replica set... — Asya Kamsky, Sep 04 '13 at 07:23

Replicated Sharding in mongodb

1 Answers1