How does placing data in various racks help to exploit the fact that intra-rack aggregated bandwidth>=inter-rack bandwidth?

Question

it says that(my interpretation after reading research paper and its reviews) "inter rack bandwidth is lower than aggregated intra rack bandwidth(not sure what it means by aggregated, it doesn't make much sense of kind of comparison). thus by placing data among various racks, clients can exploit the aggregate bandwidth from reads from various tracks. Like how? Is my question. If you place data in various racks, how can you exploit intra rack aggregated bandwidth?

in case of mutations where client has to send data, multiple racks are disadvantageous as data has to travel longer distances."..

I don't get the point it's trying to make about bandwidth. can anyone explain? Why would it be different for reads and writes? I understand write. As you write at distance=0, then if you've to write at distance=1000, then your data needs to travel longer distance. But why is it beneficial for read?

Some background information-:

Rack means collections of chunkservers(30-40).

Chunkservers are collection of 64MB chunks.

Chunks are collection of 64KB blocks.

Here's a GFS architecture-: GFS Architecture

Reference-:

Other sources-:

What's written in some solution manuals I saw online-:

To put it as simply as possible, you have multiple copies of each chunk so you can read one of them from anywhere, but need to write to all of them everywhere.

But there can be scenarios where you need to travel and spend a lot of bandwidth to read as well as it might not be in place. Plus there's some tunable consistency in these systems. You can't just read from 1 place and send results to client. You need to read from multiple places.

Another blog has given this example but I wasn't absolutely clear about it although I'm well versed with undergraduate networking courses-:

Let's say you have 10 chunkservers in a rack, all with NVMe drives delivering up to 3,200MB/s. The aggregate (reading from all chunkservers in a rack at the same time) would be 32,000MB/s. Now if the inter rack network is SFP+ then that can only deliver 10Gbps, which is less than the aggregate bandwith.

That's for ideal conditions on a single rack. Let's say the cluster has 10 racks, and the entire network is SFP+. Then the client can still only consume at 10Gbps, but by distributing the reads among all racks it becomes an average of 1Gbps per rack. Furthermore given that the topology may be uneven and some racks may have more latency than others for this client the client can choose the lowest latency ("nearest" in the paper) rack to do most of the reading from.

Another blog writes this-:

More copies of data increase the maximum possible read bandwidth. But more copies of data don't increase write bandwidth.

What's bandwidth here in GFS? How is it defined? I am thinking bandwidth is amount of data that can be transferred from a networking equipment at a time. It looks like the blog is trying to say the same thing "read from anywhere, write everywhere" but the bandwidth term used is confusing to me.

Another blog post writes this-:

Typically, the servers in a single rack will be connected by a top of rack switch which connects to every server in that rack. The servers in the rack will be able to communicate with one another at the link speed of their interface, and all of them can do this at the same time. The top of rack switch will connect further to a core switch, using high-bandwidth connections. The core switch is connected to every other top of rack switch. But usually, the link speed of the connection to the core switch will be smaller than the sum of the link speeds of the connections to every server in the rack.

The result of this is that the bandwidth available to servers within the same rack is higher than the bandwidth to communicate to servers outside that rack. (This isn't always true. Facebook builds networking so that the inter-rack bandwidth is the same as the intra-rack bandwidth. That gives flexibility at the cost of power efficiency.)

It does bring the 3 tier design principles concept of core, acces and distribution layer. Where, core switch has the best possible speed. But the aggregated distribution/access switches could also have the more speed than the core switch speed. So what, I don't get it.

How does read exploits aggregate bandwidth of multiple reads(according to the research paper) when we've placed data in multiple chunks? It doesn't make much sense to me and is confusing.

score 1 · Accepted Answer · answered Jul 28 '22 at 20:21

The aggregation is many nodes, talking to many other remote nodes, over a relatively small number of inter-rack fabric links. In general in computing, high bandwidth is easier over shorter distances and fewer hops.

Modern Ethernet switches are of course very fast, but the fundamental problem still exists. It also is not limited to the networking concept of core, access, and distribution tiers. Let's upgrade an example to 2022 speeds. Say 48 × 25 Gb servers plus a handful of 100 Gb uplinks, in a pizza box switch. Yet that terabit total between the hosts directly connected to the switch exceeds the bandwidth for inter-rack. This power-saving oversubscription pays off because most designs have some physical locality. Database servers might be in the same rack as some of the application servers, low latency and cheap bandwidth.

Distributed scale-out applications can take advantage of the total bandwidth between nodes, as they are peer to peer applications. Imagine a RAID array, except instead of storage devices, the data is copied several times and striped across many compute nodes.

Improving the overall scale and reliability of a distributed system eventually means exceeding one rack of gear. Sometimes Bad Things happen, maybe both power supplies are lost and a rack goes down. To survive this, ideally a distributed storage would have copies of data chunks spread across multiple racks. Yet that would put strain on what typically is limited inter-rack bandwidth.

An infrastructure-aware distributed application could be aware of data locality and durability requirements. Reading from multiple local nodes to take advantage of cheap bandwidth. Writing to nodes in other racks, to ensure the data can survive a rack failure. Compare to how a storage array can read from all of its member disks or caches at once, but for durability eventually needs to commit writes to slow disks.

Or you design such that inter-rack bandwidth is the same as intra-rack. More expensive in hardware, power, and cabling. Less worry that striping across racks will saturate the top of rack switches.

the "bandwidth" used here is "capacity" or "requirement"? I guess it's capacity, but I'm not sure. — gibmegucci, Jul 29 '22 at 02:24
Oversubscribing intra vs inter rack bandwidth will remain tempting due to cost reasons, whatever the speeds of a given switch. Whether a distributed application can "require" not oversubscribing depends on the design and budget (note what you cited about Facebook increasing inter rack links). Capacity planning of a specific number of nodes and their networking to meet a given load is a sperate question. — John Mahowald, Jul 30 '22 at 22:27

How does placing data in various racks help to exploit the fact that intra-rack aggregated bandwidth>=inter-rack bandwidth?

1 Answers1