We are proposing Cassandra to be implemented as a database backend for a large archiving solution (a large number of writes compared to reads). We are looking for inputs on Cassandra's replication and deployment strategy to fit our use case.
The choice of Cassandra was based on following factors:
- Supports Large throughput for ‘write’ operations - thousands of simultaneous writes per second
- Suitability for Engineering data (mainly Time series data)
- High availability to support continuous telescope operations
- Tools support e.g. Analytics, Reporting
Data Estimates
- 250 TB of growth per year (50 years of System lifetime)
Use Case
We have two data centers - Operations DC and Analytics DC (to isolate the read and write workloads). At the end of this post is the diagram depicting the proposed architecture. Due to storage constraints, we can't store data generated over life-time on Operations DC. Hence, we are planning to move the data from Operations DC to Analytics DC as per defined policy (let's say after 1 week).
Questions
- Is it possible to have one-way replication in Cassandra between Datacenters? Data from Operations DC moved into Analytics DC. But data stored after processing in Analytics DC shall not be replicated into Operations DC.
- Does Cassandra provide control on what gets replicated? We don't want both the DCs to be in synch. We want to configure what gets replicated (moved actually) into the Analytics DC. Is it possible inherently with Cassandra? If I want to specify that only data of the last one week should be replicated from Operations Data Centre to Analytics Data Center.
We are planning to use Cassandra's inbuilt feature of time-to-live to delete the data (only from operations DC). The data deleted from Operations DC should not get deleted from the Analytics DC. How to prevent replication of deleted data?
I have read that a single Cassandra node can handle up to 2-3 TB of data. Any documented reference of any larger Cassandra implementations will help.
How many Cassandra nodes shall be deployed to handle such growth? And what shall be a recommended deployment strategy?
Performance considerations: Although the storage at Operations DC will be limited (3-7 days data, about 5-10 TB), the data storage at Analytics DC is cumulative and continues to grow with time. Will the database growth at analytics DC affect the replication and degrade the performance of Operations DC.
The purpose here is to know if Cassandra's inbuilt feature can be used to support the above requirements. I'm aware of the most obvious solution. Not to have replication between the two DC. Dump the data of last one week from Operations DC and move it to Analytics DC.