Cassandra for a schemaless db, 10's of millions order tables and millions of queries per day

Question

I am building a database, with the following characteristics:

Schemaless database with a variable number of columns for each row.
Tens of millions of records and tens of columns.
Millions queries per day.
Thousands writes per day.
Queries will be filtering on several columns (not only the key).

I am considering Cassandra which is built-to-scale.

My questions are:

Do I need to scale horizontally in this case?
Does Cassandra support having several keys to point to the same column-family?

EDIT

I would like to make sure that I got your point right. So, the following example puts down what I got from your answer:

So, if we have the following column family (it holds some store products and their details)

products // column-family name
{
x = {   "id":"x", // this is unique id for the row. 
    "name":"Laptop",
    "screen":"15 inch",
    "OS":"Windows"}
y = {   "id":"y", // this is unique id for the row. 
    "name":"Laptop",
    "screen":"17 inch"}
z = {   "id":"z", // this is unique id for the row. 
    "name":"Printer",
    "page per minute":"20 pages"}
}

And, we want to add "name" search parameter, we will create another copy of the CF with different row keys as the following:

products
{
"x:name:Laptop"  = {    "id":"x", 
            "name":"Laptop",
            "screen":"15 inch",
            "OS":"Windows"}
"y:name:Laptop"  = {    "id":"y", 
            "name":"Laptop",
            "screen":"17 inch"}
"z:name:Printer" = {    "id":"z", 
            "name":"Printer",
            "ppm":"20 pages"}
}

And similarly, in order to add the "screen" search parameter:

products
{
"x:screen:15 inch" = {  "id":"x" 
            "name":"Laptop",
            "screen":"15 inch",
            "OS":"Windows"}
"y:screen:17 inch" = {  "id":"y", 
            "name":"Laptop",
            "screen":"17 inch"}
}

But, if we would like to make a query based on 10 search parameters or any combination of them (as the case in my application), then we would have to create 1023 copies of the column family [(2 to the power 10)-1]. And since most of the rows will have many of the search parameters, this means that we need about 1000 times extra storage to model the data (in this way), which is not little, especially if we have 10,000,000 rows in the original CF.

Is this the data model you suggested?

Another point: I don't manage to see exactly why creating secondary indexes would forfeit or deprive the schemaless model.

+1 for knowing the rough scale of your application ahead of design. — Andy Finkenstadt, Aug 07 '12 at 13:31

score 3 · Answer 1 · edited Aug 12 '13 at 13:10

Cassandra is not a db you can query by anything other than the row key. But you can tailor your datamodel to support those queries.

We do 175,000,000 queries a day on our 6 cassandra nodes cluster (easy!) but we only ask for data using row_keys and columns because we have made our datamodel to work that way. We do not use indexed queries.

To support richer queries we denormalize our data using the data we will use as search parameters for making the keys to retrieve the data.

Example: Consider we save the following object:

obj {
   id : xxx //assuming id is a unique id across the system
   p1 : value1
   p2 : value2
}

And we know we want to search by any of those parameters then we will save a copy of obj for column_names or keys as follows:

"p1:value1:xxx"
"p2:value2:xxx"
"p1:value1:p2:value2:xxx" 
"xxx"

This way we can search for obj with p1 = value1, p2 =value2, p1 = value1 AND p2 = value2 or by just it's unique id xxx.

The only other option if you do not want to do that is to use Secondary indexes and indexed queries but that would forfeit the "schema-less" requirement of your question.

EDIT - An example.

We want to save objects "Products" defined as

class Products{
    string uid;
    string name;
    int screen_size; //in inches
    string os;
    string brand;
}

We serialize it into a string or byteArray (I always have the tendency of using Jackson Json or Protobuf ... both work very well with cassandra and are super fast). We put that byte array into a column.

Now the important part : creating the column names and the row keys. Let's say we want to search by screen resolutions and possibly filter by brands. We define buckets for the screen size as ["0_to15", "16_to_21", "21_up"]

given column :

"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}

one copy get saved with: - key = "brand:Samsung" and column_name = "screen_size:15_uid:MI615FMDO548" - key = "brand:0_to_15" and column_name = "screen_size:15_uid:MI615FMDO548"

Why do I add the uid to the column name? To make all column names unique for unique products.

Example part 2 Now lets say we have added

"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"

We will end up with the following column family:

Products{
-Row:"brand:Samsung"
=> "screen_size:13_uid:MI615FMDO687":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
=> "screen_size:14_uid:MI615FMD5589":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
=> "screen_size:15_uid:MI615FMDO548":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
=> "screen_size:17_uid:MI615FMD1111":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
-Row:"screen_size:0_to_15"
=> "brand:Samsung_uid:MI615FMDO687":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
=> "brand:Samsung_uid:MI615FMD5589":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
=> "brand:Samsung_uid:MI615FMDO548":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
-Row:"screen_size:16_to_17"
=> "brand:Samsung_uid:MI615FMD1111":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMDO687"
=> "product":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMD5589"
=> "product":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
-Row:"uid:MI615FMDO548"
=> "product":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMD1111"
=> "product":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
}

Now by using range queries across column names you can search by brand and by screen size.

hope this was useful

I believe it is useful. However, please take a look on my new edit in the question above. Thanks. — Ababneh A, Aug 10 '12 at 11:08
sorry for the late answer to your edit/comment. I think you got it backward. You cannot search accross row key efficiently if you set up your cassandra cluster to use the random partitioner (which is advised if you want a balanced token ring). You have to bucket your products somehow and the name of those buckets will be the row key (you could use brands, a range of screens sizes, or anything you fancy), and inside those rows you can use my method for creating column names so that they get sorted to allow you to search the row for the value you need. — le-doude, Aug 31 '12 at 16:08

Cassandra for a schemaless db, 10's of millions order tables and millions of queries per day

1 Answers1

Linked