0

I'm new to Cassandra and am trying to create a schema for storing and querying a time series of mixed / heterogeneous data. I have questions on how to model the timestamps and the mixed data.

My problem can be abstracted like this:

  • I have on the order of 100 million deployed sensor pods.
  • Each pod contains a different mix of sensors, where each sensor generates either double or string readings.
  • On each day, a pod generates a list of its sensor readings (e.g. sensor1, sensor2, sensor5).

For each sensor pod, I would like to store its sensors' readings for one day. So I am planning to store data like this:

  1. podID : Unique ID for the pod
  2. dt: represents one entire day, such as July 10, 2018
  3. num_data: map that stores numeric sensor readings, e.g. { "sensor1" : -123, "sensor3" : 123 }
  4. str_data: map that stores string sensor readings, e.g. { "sensor2" : "foo", "sensor5" : "bar"}

QUESTION 1: What's the best way to store the date? I searched online and found multiple ways.

The first way is use a timestamp as the clustering key:

CREATE TABLE time_series (
    podID     text,
    dt        timestamp,
    num_data  map<text, double>,
    str_data  map<text, double>,
    PRIMARY KEY (podID, dt)
);

The second way I've found is to store the date as individual integers for year, month, and day.

CREATE TABLE time_series (
    podID     text,
    year      int,
    month     int,
    day       int,
    num_data  map<text, double>,
    str_data  map<text, double>,
    PRIMARY KEY (podID, (year, month, day))
);

A third way puts a part of the timestamp into the partition key.

PRIMARY KEY ((podID, month), dt))

How do I choose which way to go?

QUESTION 2: Is my use of map<> data structures the best way to store mixed / varying / heterogeneous types of data?

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • 1. your primary key should depends on what all query you need to execute (specifically what would be your where clause?) and what result you expect (what type of sorting you need) keeping in mind that you dont have wide row. 2. yes you can you map to keep mixed data (eg. json ) – Laxmikant Jul 12 '18 at 02:52
  • if data produced by a pod within a month is more that ~100 mb then your 3rd option will also cause wide row..so may be you need to partition the data weekly..On the other hand if a pod generats only few mb of data in entire life the your first option is also ok where you can query the data based on pod id and sorted in order of time. – Laxmikant Jul 12 '18 at 03:02

0 Answers0