1

I have a data modeling question for cases where data needs to be sorted by keys which can be modified. So , say we have a user table

{
   dept_id text,
   user_id text,
   user_name text,
   mod_date timestamp
   PRIMARY KEY (dept_id,user_id)
}

Now I can query cassandra to get all users by a dept_id.

What if I wanted to query to get all users in a dept, sorted by mod_date.

So, one way would be to

{
   dept_id text,
   mod_date timestamp,
   user_id text,
   user_name text,
   PRIMARY KEY (dept_id, mod_date,user_id)
}

But, mod_date changes every time user name is updated. So it can't be part of clustering key.

Attempt 1:
Don't update the row but instead create new record for every update.

So, say the record for user foo is like below {'dept_id1',TimeStamp1','user_id1','foo'} and then the name was changed to 'bar' and then to 'baz' . In that case we add another row to table, so it would look like

{'dept_id1',TimeStamp3','user_id1','baz'}
{'dept_id1',TimeStamp2','user_id1','bar'}
{'dept_id1',TimeStamp1','user_id1','foo'}

Now we can get all users in a dept, sorted by mod_date but it presents a different problem.

The data returned is duplicated

.

Attempt 2 : Add another column to identify the head record much like a linked list

{
   dept_id text,
   mod_date timestamp,
   user_id text,
   user_name text,
   next_record text
   PRIMARY KEY (dept_id,mod_date,user_id)
}

Every time an update happens it adds a row and also adds the PK of new record.

{'dept_id1',TimeStamp3','user_id1','baz','HEAD'}
{'dept_id1',TimeStamp2','user_id1','bar','dept_id1#TimeStamp3'}
{'dept_id1',TimeStamp1','user_id1','foo','dept_id1#TimeStamp2'}

and also add a secondary index to 'next_record' column.

Now I can support get all users in a dept, sorted by mod_date by

select * from USERS where dept_id=':dept' AND next_record='HEAD' order by mod_date.

But it looks fairly involved solution and perhaps I am missing something , a simpler solution ..

The other option is delete and insert but for high frequency changes I think Cassandra has issues with tombstones.

Suggestions/Feedback are welcome. Thanks !

factotum
  • 900
  • 10
  • 13
  • Which cassandra version are you using ? and another thing your primary key is `dept_id,user_id, mod_date` so your data is sorted by user_id then mod_date under a dept_id ? – Ashraful Islam Apr 06 '17 at 13:12
  • thanks. I updated the question to reflect the desired sort order. user_id is just noise in the question. I think hopefully the intent of question is clear. I am using Cassandra 2.2.x – factotum Apr 06 '17 at 13:48

1 Answers1

0

As I see, the simplest way is sorting users on application (client code) side. You use dept as a partition key, this means that all users in one dept can be handled one cassandra node, so there is no many users in one dept and this users can be sorted on application side fast enough.

Mikhail Baksheev
  • 1,394
  • 11
  • 13