0

I am a cassandra newbie trying to see how I can model our current sql data in cassandra. The database stores document metadata that includes document_id, last_modified_time, size_in_bytes among a host of other data, and the number of documents can be arbitrarily large and hence we are looking for a scalable solution for storage and query.

There is a requirement of 2 range queries

  1. select all docs where last_modified_time >=x and last_modified_time
  2. select all docs where size >= x and size <= y

And also a set of queries where docs needs to be grouped by specific metadata e.g.

  1. select all docs where user in (x,y,z)

What is the best practice of designing the data model based on these queries?

My initial thought is to have a table (in Cassandra 2.0, CQL 3.0) with the last_mod_time as the secondary index as follows

create table t_document ( document_id bigint,
last_mod_time bigint , size bigint, user text, .... primary key (document_id, last_mod_time) }

This should take care of query 1.

Do I need to create another table with the primary key as (document_id, size) for the query 2? Or can I just add the size as the third item in the primary key of the same table e.g. (document_id, last_mod_time, size). But in this case will the second query work without using the last_mod_time in the where clause?

For the query 3, which is all docs for one or more users, is it the best practice to create a t_user_doc table where the primary key is (user, doc_id)? Or a better approach is to create a secondary index on the user on the same t_document table?

Thanks for any help.

ssen
  • 95
  • 2
  • 7

1 Answers1

0

When it comes to inequalities, you don't have many choices in Cassandra. They must be leading clustering columns (or secondary indexes). So a data model might look like this:

CREATE TABLE docs_by_time (
dummy int,
last_modified_time timestamp,
document_id bigint,
size_in_bytes bigint,
PRIMARY KEY ((dummy),last_modified_time,document_id));

The "dummy" column is always set to the same value, and is sued as a placeholder partition key, with all data stored in a single partition.

The drawback to such a data model is that, indeed, all data is stored in a single partition. There is the maximum of 2 billion cells per partition, but more importantly, a single partition never spans nodes. So this approach doesn't scale.

You could create secondary indexes on a table:

CREATE TABLE docs (
document_id bigint,
last_modified_time timestamp,    
size_in_bytes bigint,
PRIMARY KEY ((dummy),last_modified_time,document_id));

CREATE INDEX docs_last_modified on docs(last_modified);

However secondary indexes have important drawbacks (http://www.slideshare.net/edanuff/indexing-in-cassandra), and aren't recommended for data with high cardinality. You could mitigate the cardinality issue somewhat by reducing precision on last_modified_time by, say, only storing the day component.

Marc Fielding
  • 134
  • 2
  • 6