How to handle a datacenter outage in a mutil-datacenter Cassandra Cluster

Question

Our application is running in a Cassandra Cluster of six nodes with two data centers.

Cluster information:

Cassandra version : 2.0.3

Snitch : GossipingPropertyFileSnith

Partitioner : Murmur3Partitioner

Each dc has three nodes.

Each dc has replication factor as 2.

Each node uses num_vnodes = 256. (all are virtual nodes)

DC1 is a live dc(local dc) which serves the data to the users currently. DC2 is just a back up dc(remote dc) which does not serve any data to the users. As we are planning for maintenance operations in DC1 alone, we are going to make remote dc DC2 to serve the users during the maintenance period.

During the outage, entire DC1 might be down for few days. Once the maintenance is done, we will again make DC1 to serve the data and DC2 for backing up. So we need to have up-to-date data in DC1 after outage. Our application will deal with huge data (few GBs) during the outage.

Before making DC1 down,

1) What are all the things (like commit-log settings,etc) need to be taken care in DC1 nodes

2) What are all the things (like hinted-handoff settings,etc) need to be taken care in DC2 nodes

During the outage,

3) When the entire DC1 is down, where the hints will be written (in any of the nodes of DC2?) and how to handle those hints?

After DC1 is up,

4) During the outage, replication might fail in DC1 nodes. How can we make/repair the DC1 with up-to-date data efficiently using DC2?

See also http://stackoverflow.com/questions/13647921/configuring-apache-cassandra-for-disaster-recovery — Raedwald, Dec 23 '15 at 07:34

score 1 · Accepted Answer · answered Dec 18 '15 at 11:59

making DC1 down

Before making DC1 down make sure you have run a full repair with nodetool repair.

This ensures that all data is propagated from the DC1 to DC2.

Then start killing nodes one by one form DC1. follow the step given here

Make sure your write consistency will be fulfilled with your DC2 itself or otherwise you will lose all your data.

if you are writing with consistency level ANY then it will guarantees you that write is durable.

During the outage

If you are using default cassnadra setup then the hints will only be store for 3 hours. Increasing this will give your machine unnecessarily overhead and I won't suggest you to keep 5 days hints.

You can configure hint time interval using the max_hint_window_in_ms property in the cassandra.yaml file.

I am not sure you should allow cassandra to write hints for DC1.

After DC1 is up

run a full repair with nodetool repair again to replicate data across your datacenter.

Writing with `CL.ANY` does not guarantee that the write reaches one of the replicas responsible for the data's partition key. If, say, all replicas for a partition are offline, `CL.ANY` allows another node (i.e. the coordinator) to store the write as a hinted handoff to be replayed when one of the responsible replicas becomes available -- IF one becomes available. Note that hinted handoffs _can expire_ (losing durability). The `CL.ONE` consistency level guarantees that your write made it to at least one of the responsible _replicas_. — William Price, Jan 07 '16 at 20:28

How to handle a datacenter outage in a mutil-datacenter Cassandra Cluster

1 Answers1