3

I am new to Cassandra and am trying to understand how it works. Say if a write to a number of nodes. My understanding is that depending on the hash value of the key, its decided which node owns the data and then the replication happens. While reading the data , the hash of the key determines which node has the data and then it responds back. Now my question is that if reading and writing happen from the same set of nodes which always has the data then how does read inconsistency occurs and Cassandra returns stale data ?

user3276247
  • 1,046
  • 2
  • 9
  • 24

2 Answers2

5

For Tuning consistency cassandra allows to set the consistency on per query basis.

Now for your question, Let's assume CONSISTENCY is set to ONE and Replication factor is 3.

During WRITE request coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable.

For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair.

By default, hints are saved for three hours after a replica fails because if the replica is down longer than that, it is likely permanently dead. You can configure this interval of time using the max_hint_window_in_ms property in the cassandra.yaml file. If the node recovers after the save time has elapsed, run a repair to re-replicate the data written during the down time.

Now when READ request is performed co-ordinator node sends these requests to the replicas that can currently respond the fastest. (Hence it might go to any 1 of 3 replica's).

Now imagine a situation where data is not yet replicated to third replica and during READ that replica is selected(chances are very negligible), then you get in-consistent data.

This scenario assumes all nodes are up. If one of the node is down and read-repair is not done once the node is up, then it might add up to issue.

READ With Different CONSISTENCY LEVEL

READ Request in Cassandra

undefined_variable
  • 6,180
  • 2
  • 22
  • 37
  • The data is written to all 3 replicas (not just the first) from the coordinator at the same time, it just acknowledges the write once the requested consistency has been met. – Chris Lohfink Jun 28 '17 at 19:56
  • If any of the nodes are down the coordinator will write a hint, once the node comes back up it will stream the hints to it so that it becomes consistent.Read repair is just another mechanism. A node being down doesnt mean that it becomes inconsistent – Chris Lohfink Jun 28 '17 at 20:01
  • When a Read occurs it will send a DATA request to the fastest and a DIGEST request to the other replicas. Once it gets enough of the responses to meet the CL it will return to the client. It will compare the data returned from the node to the digest of the others. If they mismatch it will send a mutation to fix the inconsistency. Theres blocking and async read repairs depending on which (DATA or DIGEST) is more recent. This may occur even after the client got its read response. Speculative retry is available if the DATA requests are too slow where it sends a DATA request to other replicas. – Chris Lohfink Jun 28 '17 at 20:03
  • @ChrisLohfink you are right about DIGEST and HINT... but HINT max widow is 3 hours by default... and DIGEST will be used to do `read_repair` based on `read_repair_chance`... In case of `CONSISTENCY ONE` results will be returned to co-ordinator as soon as 1 of replica respond.. – undefined_variable Jun 29 '17 at 05:38
  • Thanks. The way I understood from the comments and the link attached in the answer, is that if the CONSISTENCY level is that of QUORUM, read requests will be consistent and up-to-date with the latest data. let me know if this is the correct understanding. – user3276247 Jun 29 '17 at 06:51
  • Yes.. if you use CONSISTENCY of QUORUM your read request will be consistent and up-to-date – undefined_variable Jun 29 '17 at 06:55
  • @ChrisLohfink Thanks for your suggestion.... Changed answer accordingly... Thanks again – undefined_variable Jun 29 '17 at 12:56
0

Consider scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM, and the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written as cassandra had returned failure during write.

Ologn
  • 1