MongoDB: Sharding on single machine. Does it make sense?

Question

created a collection in MongoDB consisting of 11446615 documents.

Each document has the following form:

{ 
 "_id" : ObjectId("4e03dec7c3c365f574820835"), 
 "httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1", 
 "words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],     
 "howMany" : 3 
}

httpReferer: just an url

words: words parsed from the url above. Size of the list is between 15 and 90.

I am planning to use this database to obtain list of webpages which have similar content.

I 'll by querying this collection using words field so I created (or rather started creating) index on this field:

db.my_coll.ensureIndex({words: 1})

Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):

Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.

My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.

Now, time for a question:

I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?

score 17 · Accepted Answer · answered Feb 22 '12 at 19:05

17

Yes, it does make sense to shard on a single server.

At this time, MongoDB still uses a global lock per mongodb server. Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.

Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.

answered Feb 22 '12 at 19:05

EhevuTov

20,205
16
66
71

any new results with the new "lazy" locking of 2.2? – sivann Oct 29 '12 at 15:33
@sivann I'll have to look into that. Thank you. – EhevuTov Oct 29 '12 at 16:28
1

I've read that each server would compete with each other to use as much RAM as possible, forcing the others to page a lot. Is that right? – ffflabs Oct 17 '13 at 10:56

score 9 · Answer 2 · answered Aug 05 '15 at 18:41

In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.

The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.

Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.

score 2 · Answer 3 · answered Jan 07 '13 at 16:26

This doesn't have to be a mongo question, it's a general operating system question. There are three possible bottlenecks for your database use.

network (i.e. you're on a gigabit line, you're using most of it at peak times, but your database isn't really loaded down)
CPU (your CPU is near 100% but disk and network are barely ticking over)
disk

In the case of network, rewrite your network protocol if possible, otherwise shard to other machines. In the case of CPU, if you're 100% on a few cores but others are free, sharding on the same machine will improve performance. If disk is fully utilized add more disks and shard across them -- way cheaper than adding more machines.

I think it would be much more performant and efficient to RAID your drives than to shard between 2 drives. — tsturzl, Aug 05 '15 at 19:50

score 1 · Answer 4 · answered Jun 25 '11 at 14:30

1

No, it does not make sense to shard a on a single server.

There are a few exceptional cases but they mostly come down to concurrency issues related to things like running map/reduce or javascript.

answered Jun 25 '11 at 14:30

Scott Hernandez

7,452
2
34
25

It makes a lot of sense - to take advantage of multiple CPUs you can shard on the same hardware. It increases performance drastically when dealing with large sets of data (especially since MongoDB doesn't support partitioning yet.) – Elliot Chance Jul 04 '12 at 00:06
1

MongoDB uses multiple cpus/cores already. Generally people refer to sharding as a type of partitioning across multiple instances. If you have some benchmarks or test which show that it "increases performance drastically" I'd be happy to see them. – Scott Hernandez Jul 09 '12 at 14:35

MongoDB: Sharding on single machine. Does it make sense?

4 Answers4

Linked