I'm new to Cassandra and am trying to create a schema for storing and querying a time series of mixed / heterogeneous data. I have questions on how to model the timestamps and the mixed data.
My problem can be abstracted like this:
- I have on the order of 100 million deployed sensor pods.
- Each pod contains a different mix of sensors, where each sensor generates either
double
orstring
readings. - On each day, a pod generates a list of its sensor readings (e.g. sensor1, sensor2, sensor5).
For each sensor pod, I would like to store its sensors' readings for one day. So I am planning to store data like this:
- podID : Unique ID for the pod
- dt: represents one entire day, such as July 10, 2018
- num_data: map that stores numeric sensor readings, e.g. { "sensor1" : -123, "sensor3" : 123 }
- str_data: map that stores string sensor readings, e.g. { "sensor2" : "foo", "sensor5" : "bar"}
QUESTION 1: What's the best way to store the date? I searched online and found multiple ways.
The first way is use a timestamp as the clustering key:
CREATE TABLE time_series (
podID text,
dt timestamp,
num_data map<text, double>,
str_data map<text, double>,
PRIMARY KEY (podID, dt)
);
The second way I've found is to store the date as individual integers for year, month, and day.
CREATE TABLE time_series (
podID text,
year int,
month int,
day int,
num_data map<text, double>,
str_data map<text, double>,
PRIMARY KEY (podID, (year, month, day))
);
A third way puts a part of the timestamp into the partition key.
PRIMARY KEY ((podID, month), dt))
How do I choose which way to go?
QUESTION 2: Is my use of map<>
data structures the best way to store mixed / varying / heterogeneous types of data?