Cassandra cluster - data density (data size per node) - looking for feedback and advises

Question

I am considering the design of a Cassandra cluster.

The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.

However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).

I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks). Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).

The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.

I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).

Thanks in advance for your feedback.

Best regards.

Can you please post links to the advice not to store more than 5T per node? — simbo1905, Aug 03 '15 at 20:50

score 21 · Answer 1 · answered Jul 29 '15 at 02:29

The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.

In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.

One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).

Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.

Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.

Thanks your answer is very nice and the best so far. The problem with having 5 4T nodes is that I need to store eventually (after years of operations so things will change until then) more than 1PB of data (considering replication feactor). This means 250+ nodes instead of 50+, and the overhead, space requirements, power, network, air conditioning...etc that come with such a large cluster. We have time to figure this out, but I am almost sure that the system would perform fine with 10 or 20T per node using Vnodes (and possibly SSDs). — Loic, Jul 29 '15 at 08:18
That's a reasonable desire. The risks remain the same. Rebuilding a 20T node takes a significant amount of time and is prone to streaming failures. — Jeff Jirsa, Jul 30 '15 at 18:30
Finally, it's worth noting that data on disk - even cold data - has memory overhead. Bloom filters cost approximately 1-2GB/billion rows of RAM. Compression data is 1-3GB/TB of data. That means for 20T of data, you're looking at 20-60GB of RAM, just for the compression data (off heap). Your "few, large" nodes are going to have to be scale in RAM, so your actual real cost isn't going to be significantly reduced vs using smaller nodes. — Jeff Jirsa, Jul 30 '15 at 22:38
Thanks for this comment, this is quite interresting. Actually it would be significantly reduced, especially when considering the overhead of maintaining a data center 4 times bigger, taking 4 times more space and almost 4 times more power and air conditionning. Moreover, the cost of 60GB of ECC registered RAM is nothing compared to multiplying the nodes, even the cheapest crap on the market. Where did you find the information about bloom filter and compression data overhead? I did not manage to find this information anywhere and it's very usefull to size the hardware. — Loic, Jul 31 '15 at 05:00
I didn't "find" the info, I just happen to know it (I've been running in production since 0.6, and I'm a cassandra MVP and speaking at Summit in 2015). However, since you asked, I found the docs that support my memory: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_off_heap_c.html — Jeff Jirsa, Aug 03 '15 at 05:30

score 4 · Answer 2 · answered Jul 28 '15 at 15:30

I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:

See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.

I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.

If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.

Hope that helps.

Thanks for your answer. Setting up the test would be great but for fill-up a 20TB node I will need months (very small samples, I can fill billions of them every day but it's just a few bytes each, I estimated I needed around 100 days of processing). BTW, RAID 5 is highly discouraged and considered unsafe nowadays, because the volume is so large on modern spinning disks that the probabilty of a second failure during reconstruction is too high. RAID6 will become unsafe in a few years. But why would I need RAID while Cassandra now supports JBOD quite well? — Loic, Jul 28 '15 at 15:56
For a write test you could randomly generate data points and insert them in batches if there was only one test node. I didn't mean to sound like I was recommending RAID5; I was just mentioning that it works. I was forced to use it since it was set up for other applications besides Cassandra. Probably JBOD is a better option if your machines are dedicated to Cassandra. — Jim Meyer, Jul 28 '15 at 17:06

score 2 · Answer 3 · answered Jul 23 '15 at 16:08

I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).

For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].

You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.

[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/

[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling

[3] http://wiki.apache.org/cassandra/CassandraLimitations

Thanks for your answer. Actually is not my question. The design is already done and efficient, allowing to store up to 3 weeks of data per partition key with smart indexing. My question is related to the Cassandra capability to have high-density storage nodes, knowing the facts that this is not aimed to be a high-performances transactional system, and that the data will be mostly immutable, did anyone managed to handle nodes with 10 to 20 TB (or more) data without having the compaction, repairs, reconstruction...etc. neverending monopolizing the I/O on the nodes at some point. — Loic, Jul 24 '15 at 10:13

Cassandra cluster - data density (data size per node) - looking for feedback and advises

3 Answers3

Linked