0

I have a program that does some fairly intensive calculations, and I would like to cache the results of those calculations in a Cassandra table. What is the best schema to use to do so?

Currently, I'm using the following schema:

CREATE TABLE raw_data_cache (
    id uuid,
    buckets int,
    start_time timestamp,
    end_time timestamp,
    time timestamp,
    data list<float>,
    PRIMARY KEY(id,buckets,start_time,end_time,time)
) with GC_Grace_Seconds=1;

id is the id of the data source, and buckets, start_time, and end_time are processing parameters. Time is the unique "key" per column. data is the time series data values.

To insert data into the table I use a standard insert along with a timeout:

INSERT INTO raw_data_cache (id,buckets,start_time,end_time,time,data) VALUES
(?,?,?,?,?,?) USING TTL 360;

The problem with this schema is that I eventually get consistent read timeouts, and I think it's due to the number of tombstones: Read 0 live and 3777400 tombstoned cells (taken from "tracing on" in cqlsh).

I can get rid of all of them by using nodetool, but I don't want to have to do that every few minutes. Is there any better schema or usage that will improve this case?

Edit: raw_data_cache is a table for storing a processed version of raw_data. I've followed what seems to be conventional wisdom in storing raw_data, with the exception of a list<floats> (but that's because I have several different inputs per time, and I want to get all of them at once). Here's the base timeseries:

CREATE TABLE raw_data(
   id uuid,
   time timestamp,
   data list<float>,
   PRIMARY KEY (id, time)
);

My goal with raw_data_cache is to save a smaller, processed version of raw_data for a few hours.

srlm
  • 3,186
  • 2
  • 27
  • 40

1 Answers1

2

I don't think your data model is really optimized for this usage. I'm thinking that you should use a more time series based approached. With columns for each time period you want to cache. I'm not 100% sure but think GC_Grace_Seconds=1 is probably not what you really want.

This is one of the best resources for Cassandra Data Modeling: http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling. Also, there's 3 videos on the topic by the same author.

If you want to optimize for getting the newest items first you can do something like:

CREATE TABLE raw_data(
   id uuid,
   time timestamp,
   data list<float>,
   PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY (event_time DESC);

Which would make the most recent events first, which is helpful in a cache. If you wanted buckets based on hours. You could do the same trick that was previously done in the example of a date that includes hours '2013-10-27 12', and it would lump all that hours in that bucket. So maybe you could try something like:

CREATE TABLE summarized_data_cache(
    id uuid,
    time_bucket text,
    time timestamp,
    data list<float>,
    PRIMARY KEY ((id, time_bucket), time)
);

This would be fast to write, but also fast to retrieve, because everything would be stored in one wide row.

Peter Halliday
  • 301
  • 1
  • 6
  • I've added more information to the question about the context. raw_data_cache (or equivalent) needs to store the processed version of raw_data, which is modeled according to the link you've shared. Thanks! – srlm Oct 27 '13 at 04:52
  • Hmmm. Thanks for the (( ), ) primary key structure. That seems to shave off 20% or so of the read time. As far as the particular case the division is arbitrary (no hours or minutes or seconds division), so I can't pre-break it. And the entire cache (for a particular primary key) is always read at once. – srlm Oct 27 '13 at 19:28