Cassandra lightweight transaction based sequence generator consistency failure?

Question

Summary

Our team inherited this sequence generator implemented using Cassandra;

Table

CREATE TABLE IF NOT EXISTS sequences (
  id_name varchar,
  next_id bigint,
  instance_name varchar,
  PRIMARY KEY (id_name)
)WITH COMPRESSION = { ... };

GET_LOCK("UPDATE sequences USING TTL 10 set instance_name = ?  where id_name = ? IF instance_name = null", ConsistencyLevel.LOCAL_QUORUM), 

SELECT_SEQUENCE("SELECT next_id from sequences where id_name = ?",
            ConsistencyLevel.LOCAL_QUORUM)

UPDATE_SEQUENCE("UPDATE sequences SET next_id= ? where id_name= ? IF next_id= ?",ConsistencyLevel.LOCAL_QUORUM), 

REMOVE_LOCK("UPDATE sequences set instance_name = null where id_name = ? IF instance_name = ?", ConsistencyLevel.LOCAL_QUORUM);

(note: ConsistencyLevel was set to LOCAL_SERIAL in Java)

it was running fine until yesterday, we found two different java App nodes got the same sequence number

Time stamps when this happened

AppNode 1

getlock:           4:25:14.480 
UpdateSequence:    4:25:14.486 

AppNode 2

getlock:           4:25:14,489
UpdateSequence:    4:25:14,496

How can this happen? How can we find out what exactly happened?

1) At what point are you reading the sequence? 2) Are the both calls to same datacenter? 3)where do you get these timestamps .ie is it time of request or time of response in app node or from cassandra logs — Kavithakaran Kanapathippillai, Jun 27 '20 at 07:21
It reads the counter after the GET_LOCK and before the UPDATE_SEQUENCE, both calls went to the same data center. timestamps are from Java App's log — user3277841, Jun 27 '20 at 18:18

Kavithakaran Kanapathippillai · Answer 1 · 2020-07-01T10:45:47.780

A Possible scenario to go through

next_id can be read by any instance if the instance_name expired due to TTL

   SELECT_SEQUENCE("SELECT next_id from sequences where id_name = ?",
                   ConsistencyLevel.LOCAL_QUORUM)

Now assume the following sequence of operations
1. instanceOne sets the lock ----> instance_name=instanceOne
2. instanceOne reads the next_id value ----> next_id=value1
3. instanceOne has some problem, it did not call UPDATE_SEQUENCE for 10s
4. 10s has passed ----> instance_name=null
5. instanceTwo set the lock -> instance_name=instanceTwo
6. instanceOne reads the same next_id value -> next_id=value1
7. Both instances tries to update the next_id with same next value. But based on the values at this point instanceTwo succeeds and it is a no operation for instanceOne
8. Both instances tries to UNLOCK by removing the the instance_name .But based on the values at this point instanceTwo succeeds and it is a no operation for instanceOne

Not sure about the time stamps of the operations.
1. One issue is how successfully they were co-related to the same value reads
2. If the logs are from application, it can represent the query attempted time rather the query execution time in Cassandra

This could be a possible scenario to explain what happened, However the timestamp in the log doesn't seem to support it. 10s is a long time. We are trying to see if we can find any trace in the casandra's own log. thanks for the analysis! — user3277841, Jul 02 '20 at 16:09
If you were unable to find-out in Cassandra the timing of this operations, may be worth modifying your app log like ` issuing getlock` and `complete getlock` and the same for `UpdateSequence` method — Kavithakaran Kanapathippillai, Jul 02 '20 at 16:14

Cassandra lightweight transaction based sequence generator consistency failure?

1 Answers1

A Possible scenario to go through