I have a program that does some fairly intensive calculations, and I would like to cache the results of those calculations in a Cassandra table. What is the best schema to use to do so?
Currently, I'm using the following schema:
CREATE TABLE raw_data_cache (
id uuid,
buckets int,
start_time timestamp,
end_time timestamp,
time timestamp,
data list<float>,
PRIMARY KEY(id,buckets,start_time,end_time,time)
) with GC_Grace_Seconds=1;
id
is the id of the data source, and buckets
, start_time
, and end_time
are processing parameters. Time
is the unique "key" per column. data
is the time series data values.
To insert data into the table I use a standard insert along with a timeout:
INSERT INTO raw_data_cache (id,buckets,start_time,end_time,time,data) VALUES
(?,?,?,?,?,?) USING TTL 360;
The problem with this schema is that I eventually get consistent read timeouts, and I think it's due to the number of tombstones: Read 0 live and 3777400 tombstoned cells
(taken from "tracing on" in cqlsh).
I can get rid of all of them by using nodetool
, but I don't want to have to do that every few minutes. Is there any better schema or usage that will improve this case?
Edit:
raw_data_cache
is a table for storing a processed version of raw_data
. I've followed what seems to be conventional wisdom in storing raw_data
, with the exception of a list<floats>
(but that's because I have several different inputs per time, and I want to get all of them at once). Here's the base timeseries:
CREATE TABLE raw_data(
id uuid,
time timestamp,
data list<float>,
PRIMARY KEY (id, time)
);
My goal with raw_data_cache
is to save a smaller, processed version of raw_data for a few hours.