2

I have a situation in Cassandra cluster (deployed over ec2 instance) such that, the disk space is going to run out of space in each node of the cluster. Now if I add some more instances in the Cassandra cluster, will it increase disk space?

What i mean, whenever we are running out of space, can we add more instances to cassandra cluster to inrease overall disk space?

Is it a right way to do, If so?

user2392631
  • 497
  • 8
  • 15

2 Answers2

5

What i mean, whenever we are running out of space, can we add more instances to cassandra cluster to inrease overall disk space?

Yes, and yes.

Consider a 4 node cluster, with a replication factor (RF) of 3, with 100GB of storage per node. Assume that the initial complete copy of the data footprint is 60GB. With 4 nodes and a RF of 3, each node will be responsible for 3/4 of the data, or 45GiB.

Address      Load      Owns      Total
10.0.0.1     45.0 GiB  75.0%     100Gb
10.0.0.2     45.0 GiB  75.0%     100Gb
10.0.0.3     45.0 GiB  75.0%     100Gb
10.0.0.4     45.0 GiB  75.0%     100Gb

With size tiered compaction (default) you want to keep each node at under 50% of total disk usage. This set up allows for that.

However, let's say the app team runs a big load overnight. We come in tomorrow morning, and find this:

Address      Load      Owns      Total
10.0.0.1     70.0 GiB  75.0%     100Gb
10.0.0.2     70.0 GiB  75.0%     100Gb
10.0.0.3     70.0 GiB  75.0%     100Gb
10.0.0.4     70.0 GiB  75.0%     100Gb

Essentially, a complete copy of the data has grown to 93.3 GiB. To bring the amount of data per disk back down below 50%, we will have to add more nodes.

But how many?

If we add a single node (maintaining a RF of 3), that means each node becomes responsible for 3/5 (60% of the data), which is 55.98 GiB. Close, but not quite there.

If we add two nodes, that brings us to a total of 6, which means that each node is responsible for 50% of the data, which is 46.65 GiB. That does bring us back under %50 per node, so we should add at least two nodes.

After doing so, the cluster should look like this:

Address      Load       Owns      Total
10.0.0.1     46.65 GiB  50.0%     100Gb
10.0.0.2     46.65 GiB  50.0%     100Gb
10.0.0.3     46.65 GiB  50.0%     100Gb
10.0.0.4     46.65 GiB  50.0%     100Gb
10.0.0.5     46.65 GiB  50.0%     100Gb
10.0.0.6     46.65 GiB  50.0%     100Gb

Note, that simply bootstrapping in new nodes only moves data to those nodes. It does not remove it from the existing nodes. For that, you should run a nodetool cleanup on each pre-existing node.

Aaron
  • 55,518
  • 11
  • 116
  • 132
1

You can add more nodes to the cluster and then re-balance the cluster. That'll spread out your data to more nodes and should reduce the amount of data on individual nodes. That, provided your data is partitioned well enough. At the same time, do look into your TTL values and GC_grace and ensure that the amount of space you are consuming is really warranted.

Gautam
  • 564
  • 2
  • 12