0

I want to put a third dimension criteria on queries in Cassandra. It already alows efficient 2-d queries because it is not simply a key-value store, but actually a key-key value store. That is:

Simple key-value store:

enter image description here

Key-key-value store:

enter image description here

So the attraction with Cassandra is that given a value for keyA, I can perform very efficient range queries along keyB, because they are contiguously stored.

Now is it possibe, given keyA and keyB, to also have an index along a third dimension, say keyC, so that I can limit which values are returned based on keyC?

So essentially:

enter image description here

Basically given keyA, say keyA-1, and a range of KeyB, say keyB-2 thru keyB-4, I want only to return the values corresponding with keyC-3, shown green above.

I know this is possible because even a simple key-value store can do it with multiple indices. The question is, is it efficient. Could I still perform really fast range queries along keyB?

My use case is time series, where I want to store minute-resolution, and daily-resolution data for the same series. So keyA would be the series I want, keyB would be the day, and keyC would be the minute. I want to do this because storing everything as minute would mean if I needed the daily data, it would mean getting far too much data out and over the network (24*60 minutes per day and I only want one of them), into memory, and lots of client-side aggregation.

I know I could store minute and daily in separate tables, but that would limit my flexibility somewhat, not to mention the cleanliness of the schema.

If this is not easy/efficient in Cassandra, is this possible in RIAK TS?

Thomas Browne
  • 23,824
  • 32
  • 78
  • 121

1 Answers1

1

Basically given keyA, say keyA-1, and a range of KeyB, say keyB-2 thru keyB-4, I want only to return the values corresponding with keyC-3, shown green above.

Yes it is possible with the following table structure

CREATE TABLE data (
     keyA text,
     keyC text,
     keyB int,
     val double,
     PRIMARY KEY ((keyA), keyC, keyB)
);

 SELECT * FROM data WHERE keyA='xxx' AND keyC='yyy' AND keyB>=aaa AND keyB<=bbb;

The abstraction for this table can be seen as:

Map<KeyA,SortedMap<KeyC,SortedMap<KeyB,val>>>  

So keyA would be the series I want, keyB would be the day, and keyC would be the minute

So essentially, with the above table, you can answer to the query: Give me all values for a serie S (keyA), for the minute M (keyC) and for day (keyB) between X and Y very very efficiently because it results in sequential scan...

The only problem now is that the partition key, which is only base on the serie ID (keyA) will grow arbitrary very large.

One solution is to split it by year, e.g. having a composite partition key like PRIMARY KEY((keyA, year), keyC, keyB). This would impose an extra constraint on your query: you must to provide serie ID AND the year every time

doanduyhai
  • 8,712
  • 27
  • 26