-2

I have a collection that only holds 10 million documents, totaling 10 gigabytes. This may not seem like enough to necessitate sharding.

But there is a query that takes 1000 seconds to complete on this collection.

If I divide this collection into 1000 shards, then I can take advantage of the divide and conquer strategy, and reduce the query speed to 1 second (in theory, excluding overhead and other complications).

Is the above scenario not the primary reason for sharding? If so, it seems odd that MongoDB Atlas only allows 50 shards maximum.

  • _"excluding overhead and other complications"_ - presumably these shards are running on a spherical server in a vacuum. – jonrsharpe Jul 25 '23 at 15:00
  • Do you have access to 1000+ nodes? I don't think so, why do you ask such question? Do you expect to get the data in less than one second just by adding shards? – Wernfried Domscheit Jul 26 '23 at 07:29
  • 1
    As usual, it is very difficult to answer your questions, because you don't provide any sample input data, nor any query nor the expected result. – Wernfried Domscheit Jul 26 '23 at 07:30

1 Answers1

2

Parallelism doesn't quite work like that.

Check out Amdahl's law

When you have 1 replica set, your read query can be handle by a single node, so it just does the work.

As soon as you add a second replica set, you also need to add a query router to determine which replica set(s) need to be sent the query, send out the query, collect the responses from each replica set, waiting for the slowest one to respond, and then combine the partial result sets and send a reply to the requestor.

So when handling a distributed query, there is inherently more work to be done, more communication between nodes, and youhave to wait on the slowest participant in order to complete the query.

Build out more than a few dozen shards, and you rather quickly start losing performance because of all the coordination required.

MongoDB recommends a soft limit of around 1 TiB per replica set.

Joe
  • 25,000
  • 3
  • 22
  • 44
  • How do you explain facebook's architecture then? `Shard Manager manages tens of millions of shards hosted on hundreds of thousands of servers across hundreds of applications in production.` https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/; How is facebook able to scale to millions of shards then? – Bear Bile Farming is Torture Jul 26 '23 at 18:05
  • 1
    What facebook calls a shard is different than what mongodb calls a shard. In mongo terminology a shard is a replica set (consisting of several servers) that store the same portion of the overall data. It wouldn't make much sense to say that facebook runs tens of millions of replica sets on hundreds of thousands of servers. What facebook is calling a "shard" is probably close to what mongodb calls a "chunk" or "range" – Joe Jul 26 '23 at 23:15