I am building a database, with the following characteristics:
- Schemaless database with a variable number of columns for each row.
- Tens of millions of records and tens of columns.
- Millions queries per day.
- Thousands writes per day.
- Queries will be filtering on several columns (not only the key).
I am considering Cassandra which is built-to-scale.
My questions are:
- Do I need to scale horizontally in this case?
- Does Cassandra support having several keys to point to the same column-family?
EDIT
I would like to make sure that I got your point right. So, the following example puts down what I got from your answer:
So, if we have the following column family (it holds some store products and their details)
products // column-family name
{
x = { "id":"x", // this is unique id for the row.
"name":"Laptop",
"screen":"15 inch",
"OS":"Windows"}
y = { "id":"y", // this is unique id for the row.
"name":"Laptop",
"screen":"17 inch"}
z = { "id":"z", // this is unique id for the row.
"name":"Printer",
"page per minute":"20 pages"}
}
And, we want to add "name" search parameter, we will create another copy of the CF with different row keys as the following:
products
{
"x:name:Laptop" = { "id":"x",
"name":"Laptop",
"screen":"15 inch",
"OS":"Windows"}
"y:name:Laptop" = { "id":"y",
"name":"Laptop",
"screen":"17 inch"}
"z:name:Printer" = { "id":"z",
"name":"Printer",
"ppm":"20 pages"}
}
And similarly, in order to add the "screen" search parameter:
products
{
"x:screen:15 inch" = { "id":"x"
"name":"Laptop",
"screen":"15 inch",
"OS":"Windows"}
"y:screen:17 inch" = { "id":"y",
"name":"Laptop",
"screen":"17 inch"}
}
But, if we would like to make a query based on 10 search parameters or any combination of them (as the case in my application), then we would have to create 1023 copies of the column family [(2 to the power 10)-1]. And since most of the rows will have many of the search parameters, this means that we need about 1000 times extra storage to model the data (in this way), which is not little, especially if we have 10,000,000 rows in the original CF.
Is this the data model you suggested?
Another point: I don't manage to see exactly why creating secondary indexes would forfeit or deprive the schemaless model.