I am a cassandra newbie trying to see how I can model our current sql data in cassandra. The database stores document metadata that includes document_id, last_modified_time, size_in_bytes among a host of other data, and the number of documents can be arbitrarily large and hence we are looking for a scalable solution for storage and query.
There is a requirement of 2 range queries
- select all docs where last_modified_time >=x and last_modified_time
- select all docs where size >= x and size <= y
And also a set of queries where docs needs to be grouped by specific metadata e.g.
- select all docs where user in (x,y,z)
What is the best practice of designing the data model based on these queries?
My initial thought is to have a table (in Cassandra 2.0, CQL 3.0) with the last_mod_time as the secondary index as follows
create table t_document (
document_id bigint,
last_mod_time bigint ,
size bigint,
user text,
....
primary key (document_id, last_mod_time)
}
This should take care of query 1.
Do I need to create another table with the primary key as (document_id, size) for the query 2? Or can I just add the size as the third item in the primary key of the same table e.g. (document_id, last_mod_time, size). But in this case will the second query work without using the last_mod_time in the where clause?
For the query 3, which is all docs for one or more users, is it the best practice to create a t_user_doc table where the primary key is (user, doc_id)? Or a better approach is to create a secondary index on the user on the same t_document table?
Thanks for any help.