1

Summary

Our team inherited this sequence generator implemented using Cassandra;

Table

CREATE TABLE IF NOT EXISTS sequences (
  id_name varchar,
  next_id bigint,
  instance_name varchar,
  PRIMARY KEY (id_name)
)WITH COMPRESSION = { ... };

GET_LOCK("UPDATE sequences USING TTL 10 set instance_name = ?  where id_name = ? IF instance_name = null", ConsistencyLevel.LOCAL_QUORUM), 

SELECT_SEQUENCE("SELECT next_id from sequences where id_name = ?",
            ConsistencyLevel.LOCAL_QUORUM)

UPDATE_SEQUENCE("UPDATE sequences SET next_id= ? where id_name= ? IF next_id= ?",ConsistencyLevel.LOCAL_QUORUM), 

REMOVE_LOCK("UPDATE sequences set instance_name = null where id_name = ? IF instance_name = ?", ConsistencyLevel.LOCAL_QUORUM);

(note: ConsistencyLevel was set to LOCAL_SERIAL in Java)

it was running fine until yesterday, we found two different java App nodes got the same sequence number

Time stamps when this happened

AppNode 1

getlock:           4:25:14.480 
UpdateSequence:    4:25:14.486 

AppNode 2

getlock:           4:25:14,489
UpdateSequence:    4:25:14,496

How can this happen? How can we find out what exactly happened?

user3277841
  • 339
  • 2
  • 6
  • 17

1 Answers1

1

A Possible scenario to go through

  • next_id can be read by any instance if the instance_name expired due to TTL
   SELECT_SEQUENCE("SELECT next_id from sequences where id_name = ?",
                   ConsistencyLevel.LOCAL_QUORUM)
  • Now assume the following sequence of operations

    1. instanceOne sets the lock ----> instance_name=instanceOne
    2. instanceOne reads the next_id value ----> next_id=value1
    3. instanceOne has some problem, it did not call UPDATE_SEQUENCE for 10s
    4. 10s has passed ----> instance_name=null
    5. instanceTwo set the lock -> instance_name=instanceTwo
    6. instanceOne reads the same next_id value -> next_id=value1
    7. Both instances tries to update the next_id with same next value. But based on the values at this point instanceTwo succeeds and it is a no operation for instanceOne
    8. Both instances tries to UNLOCK by removing the the instance_name .But based on the values at this point instanceTwo succeeds and it is a no operation for instanceOne
  • Not sure about the time stamps of the operations.
    1. One issue is how successfully they were co-related to the same value reads
    2. If the logs are from application, it can represent the query attempted time rather the query execution time in Cassandra
  • This could be a possible scenario to explain what happened, However the timestamp in the log doesn't seem to support it. 10s is a long time. We are trying to see if we can find any trace in the casandra's own log. thanks for the analysis! – user3277841 Jul 02 '20 at 16:09
  • If you were unable to find-out in Cassandra the timing of this operations, may be worth modifying your app log like ` issuing getlock` and `complete getlock` and the same for `UpdateSequence` method – Kavithakaran Kanapathippillai Jul 02 '20 at 16:14