How to retrieve only the information that got changed from Cassandra?

Question

I am working on designing the Cassandra Column Family schema for my below use case.. I am not sure what is the best way to design the cassandra column family for my below use case? I will be using CQL Datastax Java driver for this..

Below is my use case and the sample schema that I have designed for now -

SCHEMA_ID       RECORD_NAME               SCHEMA_VALUE              TIMESTAMP
1                  ABC                     some value                 t1
2                  ABC                     some_other_value           t2
3                  DEF                     some value again           t3
4                  DEF                     some other value           t4
5                  GHI                     some new value             t5
6                  IOP                     some values again          t6

Now what I will be looking from the above table is something like this -

For the first time whenever my application is running, I will ask for everything from the above table.. Meaning give me everything from the above table..
Then every 5 or 10 minutes, my background thread will be checking this table and will ask for give me everything that has changed only (full row if anything got changed for that row).. so that is the reason I am using timestamp as one of the column here..

But I am not sure how to design the query pattern in such a way such that both of my use cases gets satisfied easily and what will be the proper way of designing the table for this? Here SCHEMA_ID will be primary key I am thinking to use...

I will be using CQL and Datastax Java driver for this..

Update:-

If I am using something like this, then is there any problem with this approach?

CREATE TABLE TEST (SCHEMA_ID TEXT, RECORD_NAME TEXT, SCHEMA_VALUE TEXT, LAST_MODIFIED_DATE TIMESTAMP, PRIMARY KEY (ID));

INSERT INTO TEST (SCHEMA_ID, RECORD_NAME, SCHEMA_VALUE, LAST_MODIFIED_DATE) VALUES ('1', 't26',  'SOME_VALUE', 1382655211694);

Because, in my this use case, I don't want anybody to insert same SCHEMA_ID everytime.. SCHEMA_ID should be unique whenever we are inserting any new row into this table.. So with your example (@omnibear), it might be possible, somebody can insert same SCHEMA_ID twice? Am I correct?

And also regarding type you have taken as an extra column, that type column can be record_name in my example..

Off the top of my head: it is not necessary to satisfy all use cases with 1 table. One of the principles of NoSQL storage is to embrace redundancy whenever it makes sense. You are working in a distributed environment so storage is not as costly. If you can solve the issue by creating two instead of one table - just do it :-) — John, Oct 28 '13 at 09:24
Thanks omnibear for the suggestion.. But I guess, I can achieve my second question answer with my current table architecture? Right? If yes, then how can I do that? Any thoughts? — AKIWEB, Oct 28 '13 at 17:04

John · Accepted Answer · 2013-10-30T08:57:32.343

Regarding 1) Cassandra is used for heavy writing, lots of data on multiple nodes. To retrieve ALL data from this kind of set-up is daring since this might involve huge amounts that have to be handled by one client. A better approach would be to use pagination. This is natively supported in 2.0.

Regarding 2) The point is that partition keys only support EQ or IN queries. For LT or GT (< / >) you use column keys. So if it makes sense to group your entries by some ID like "type", you can use this for your partition key, and a timeuuid as a column key. This allows to query for all entries newer than X like so

create table test 
  (type int, SCHEMA_ID int, RECORD_NAME text, 
  SCHEMA_VALUE text, TIMESTAMP timeuuid, 
  primary key (type, timestamp));

select * from test where type IN (0,1,2,3) and timestamp < 58e0a7d7-eebc-11d8-9669-0800200c9a66;

Update:

You asked:

somebody can insert same SCHEMA_ID twice? Am I correct?

Yes, you can always make an insert with an existing primary key. The values at that primary key will be updated. Therefore, to preserve uniqueness, a UUID is often used in the primary key, for instance, timeuuid. It is a unique value containing a timestamp and the MAC address of the client. There is excellent documentation on this topic.

General advice:

Write down your queries first, then design your model. (Use case!)
Your queries define your data model which in turn is primarily defined by your primary keys.

So, in your case, I'd just adapt my schema above, like so:

CREATE TABLE TEST (SCHEMA_ID TEXT, RECORD_NAME TEXT, SCHEMA_VALUE TEXT,   
LAST_MODIFIED_DATE TIMEUUID, PRIMARY KEY (RECORD_NAME, LAST_MODIFIED_DATE));

Which allows this query:

select * from test where RECORD_NAME IN ("componentA","componentB")
  and LAST_MODIFIED_DATE < 1688f180-4141-11e3-aa6e-0800200c9a66;

the uuid corresponds to -> Wednesday, October 30, 2013 8:55:55 AM GMT
so you would fetch everything after that

Thanks a lot omnibear for the suggestion.. It makes sense now.. But I have one problem with this approach.. I have updated my question with my confusion.. — AKIWEB, Oct 29 '13 at 15:57

How to retrieve only the information that got changed from Cassandra?

1 Answers1

Linked