How can I version my data with Cassandra?

Question

I want to use the key-value pair feature of Cassandra. Until now, I have been using Kyotocabinet but it does not support multiple writes and hence, I want to use Cassandra for versioning my tabular data.

Roll No, Name, Age, Sex
14BCE1008, Aviral, 22, Male
14BCE1007, Shantanu, 22, Male

The above data is tabular(csv). It's version 1. Next is version 2:

Roll No, Name, Age, Sex
14BCE1008, Aviral, 22, Male
14BCE1007, Shantanu, 22, Male
14BCE1209, Piyush, 22, Male

Hence, I would call the above version as version 2 with the following diff: insert_patch: 14BCE1209 as key(PK) and 14BCE1209, Piyush, 22, Male as value. I am familiar with the creation of the table but unable to figure out the versioning part.

@AlexOtt I didn't get you, are you asking me the version number prior inserting? Nope. I will resolve it from the current tables(existing) based on my nomenclature, at least thats the plan but whatver you suggest, feel free. — aviral sanjay, Jan 15 '19 at 09:53

score 5 · Accepted Answer · answered Jan 15 '19 at 10:16

To have multiple versions of data in your table if you use composite primary key instead of primary key consisting of one field.

So table definition could look as following (if you "know" the version number prior inserting the data):

create table test( 
  id text,
  version int,
  payload text,
  primary key (id, version)
) with clustering order by (version desc);

and inserting data as:

insert into test (id, version, payload) values ('14BCE1209', 1, '....');
insert into test (id, version, payload) values ('14BCE1209', 2, '....');

to select the latest value for given key you can use LIMIT 1:

SELECT * from test where id = '14BCE1209' LIMIT 1;

and to select latest versions for all partitions (not recommended, just for example - need a special approach for effective processing):

SELECT * from test PER PARTITION LIMIT 1;

But this will work only for cases when you know version in advance. If you don't know, then you can use timeuuid type for version instead of the int:

create table test( 
  id text,
  version timeuuid,
  payload text,
  primary key (id, version)
) with clustering order by (version desc);

and inserting data as (instead of now() you can use current timestamp generated from your code):

insert into test (id, version, payload) values ('14BCE1209', now(), '....');
insert into test (id, version, payload) values ('14BCE1209', now(), '....');

and select will work the same as above...

Thanks. But why did you use just one table for all versions? Can't it be one table per version? — aviral sanjay, Jan 15 '19 at 10:33
You can have multiple tables for different versions, but if you have too many versions (> 500) this will affect performance as every table requires an additional memory for metadata, etc. Plus, in your code you'll need to handle the querying of different tables for different versions. But then I don't understand the question... — Alex Ott, Jan 15 '19 at 10:54

How can I version my data with Cassandra?

1 Answers1