Cassandra Schema To Cache Application Data

Question

I have a program that does some fairly intensive calculations, and I would like to cache the results of those calculations in a Cassandra table. What is the best schema to use to do so?

Currently, I'm using the following schema:

CREATE TABLE raw_data_cache (
    id uuid,
    buckets int,
    start_time timestamp,
    end_time timestamp,
    time timestamp,
    data list<float>,
    PRIMARY KEY(id,buckets,start_time,end_time,time)
) with GC_Grace_Seconds=1;

id is the id of the data source, and buckets, start_time, and end_time are processing parameters. Time is the unique "key" per column. data is the time series data values.

To insert data into the table I use a standard insert along with a timeout:

INSERT INTO raw_data_cache (id,buckets,start_time,end_time,time,data) VALUES
(?,?,?,?,?,?) USING TTL 360;

The problem with this schema is that I eventually get consistent read timeouts, and I think it's due to the number of tombstones: Read 0 live and 3777400 tombstoned cells (taken from "tracing on" in cqlsh).

I can get rid of all of them by using nodetool, but I don't want to have to do that every few minutes. Is there any better schema or usage that will improve this case?

Edit: raw_data_cache is a table for storing a processed version of raw_data. I've followed what seems to be conventional wisdom in storing raw_data, with the exception of a list<floats> (but that's because I have several different inputs per time, and I want to get all of them at once). Here's the base timeseries:

CREATE TABLE raw_data(
   id uuid,
   time timestamp,
   data list<float>,
   PRIMARY KEY (id, time)
);

My goal with raw_data_cache is to save a smaller, processed version of raw_data for a few hours.

Peter Halliday · Answer 1 · 2013-10-27T17:35:41.873

I don't think your data model is really optimized for this usage. I'm thinking that you should use a more time series based approached. With columns for each time period you want to cache. I'm not 100% sure but think GC_Grace_Seconds=1 is probably not what you really want.

This is one of the best resources for Cassandra Data Modeling: http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling. Also, there's 3 videos on the topic by the same author.

If you want to optimize for getting the newest items first you can do something like:

CREATE TABLE raw_data(
   id uuid,
   time timestamp,
   data list<float>,
   PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY (event_time DESC);

Which would make the most recent events first, which is helpful in a cache. If you wanted buckets based on hours. You could do the same trick that was previously done in the example of a date that includes hours '2013-10-27 12', and it would lump all that hours in that bucket. So maybe you could try something like:

CREATE TABLE summarized_data_cache(
    id uuid,
    time_bucket text,
    time timestamp,
    data list<float>,
    PRIMARY KEY ((id, time_bucket), time)
);

This would be fast to write, but also fast to retrieve, because everything would be stored in one wide row.

I've added more information to the question about the context. raw_data_cache (or equivalent) needs to store the processed version of raw_data, which is modeled according to the link you've shared. Thanks! — srlm, Oct 27 '13 at 04:52
Hmmm. Thanks for the (( ), ) primary key structure. That seems to shave off 20% or so of the read time. As far as the particular case the division is arbitrary (no hours or minutes or seconds division), so I can't pre-break it. And the entire cache (for a particular primary key) is always read at once. — srlm, Oct 27 '13 at 19:28

Cassandra Schema To Cache Application Data

1 Answers1