3

We are using cassandra for IOT based application. Currently we are receiving 10 GB data in every day. We are storing all data into single table in Cassandra in the way of time series model. What is the best approach keeping the data in single table or multiple table(year,monthly).

schema:

CREATE TABLE SensorData (
    cid text,
    event_date date,
    event_time timestamp,
    data text,
    device_id text,
    device_type text,
    rawdata text,
    PRIMARY KEY ((cid, event_date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)

List of Spark Job:

  1. Need to run client jobs for single date
  2. Need to run client jobs for single date. (applying allow filtering)
  3. Need to specific job for all client single data
  4. Need to specific job for all client single data

If the data size increase job get slow. Do we need to focus on performance metrics(cassandra/spark) or shall we keep the data in different small talble.

BackUp Strategy

What is the best way to do a backup strategy?

  1. Cassandra way https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_backup_restore_c.html

  2. External data source type to disk like csv/flat file etc..

Marko Švaljek
  • 2,071
  • 1
  • 14
  • 26
kannadhasan
  • 339
  • 4
  • 17
  • How frequently are you recording data for a single cid? Different versions of C* have a practical max size of a partition. Which version of C* are you running? – bechbd Apr 17 '17 at 15:05
  • We are running cassandra version 3.2.1. Standard parameter we are storing in minute wise.Some critical parameter we are storing in second wise. – kannadhasan Apr 17 '17 at 15:36

1 Answers1

4

From what I can tell you seem to be o.k. when it comes to schema. If in the future you might get messages that are on a millisecond level you might want to partition even lower than on a day level that you have now.

But day is probably o.k. since sensors rarely send the data in faster than seconds. I even worked on one project where we partitioned by month, and the data came in in seconds and it wasn't that much of a deal. So from the schema par you are o.k.

Spark jobs also seem to be covered by the schema.

  1. is ok since you can get all the data for a day without too much trouble
  2. I would avoid apply filtering especially if you have 10 GB per day with time this will only get worse. If you provide some details about why you need filtering I might help. My honest advice would be to avoid it all together.
  3. This one requires you to iterate over date partitions. I guess my best tip would be to simply go back into history day by day. And you need a smart termination condition. Either fixed for all the clients (i.e. don't go into past for let's say longer than x month). Or you can make it smarter, i.e. when you go into "all" history for client you stop after let's say 10 days buckets are empty. But this might be tricky is some clients have longer outages. Anyway, you should make this configurable.
  4. This might be a good answer, but If you are using spark already this shouldn't be a problem.

With cassandra It's better to simply prepare the data upfront. So your schema works o.k. for 1 and 2 you are fine. 3 is also fine but 4 is always kind of tricky. By design if you add 10 GB to a set every day and you want to process all of it, it will take longer and longer every day. There isn't really much you can do about it if you want all the data.

Usually in this situation you make some sort of etl that already makes let's say sum and average information that you need for specific time units. i.e. if your report is for a whole day, you make new entries in cassandra for that day and store the result. That way you won't have to reprocess it again every single time. So your problem is not multiple smaller table but the way you are designing your ETL operations.

For backups I would suggest usual cassandra work flow. What you provided in the link works fine. Never had any problems with it. Also I wrote some tools that exported stuff into csv but this was more for other clients and companies that wanted to do their own processing on the data we had.

Updated answer after additional questions:

Q1: How about having data per day that will be truncated monthly

CREATE TABLE SensorData(
  cid text,
  event_date date,
  event_time timestamp,
  data text,
  device_id text,
  device_type text,
  rawdata text,
  PRIMARY KEY ((cid, event_date), event_time, device_id, device_type)
) WITH CLUSTERING ORDER BY (event_time DESC)

Q2: Does it make sense to create following table for historical processing:

CREATE TABLE SensorData_YYYYMM (
  cid text,
  event_date text,
  event_time timestamp,
  data text,
  device_id text,
  device_type text,
  rawdata text, PRIMARY KEY ((cid, event_date), event_time, device_id, device_type) 
) WITH CLUSTERING ORDER BY (event_time DESC)

The idea itself is not that bad but I have couple of concerns. First one is that you would put all the data for one client day into single partition. Depending on the number of clients that you would get this might get too big. Usually in IOT you want to keep data from single sensor in a single partition. And you add some time dimension key to the partition key. This makes it relatively easy to make etl jobs. So basically the key for first table might be ((cid, device_id, event_date) event_time, device_type

Second is that if you are ever anticipating that two messages from one device could come into system with same millicesond, you might loose data. So I would advise you to go with timeuuid type for event_time. Yes this takes more space but you are safe against all the scenarios where you might loose some readings in the future (when new clients will come in, you never know how often they will send). With timeuuid you are even safe if for some reason the device will aggregate multiple messages to save on bandwidth.

One problem that you would have with the first table in case that I described is that it might become problematic to know all the device_id when you will go over it with an ETL. I would suggest to have one table where there would be a list of all the device_id for a single client. And every time you provision a sensor for a client you also make write to this table. Then when you are doing aggregations i.e. in spark you can easily combine the device_id with cid and event_date to isolate the partitions that you need. You should always avoid spark going over all the entries in the table that's just too expensive and in my opinion this is one way to limit data traversal. This worked really well on one project I worked.

I'll now start touching on the question 2 but will also refer to question 1. The thing is that usually I would not recommend keeping the raw data once again. This wouldn't make data management much easier. I would recommend just using the standard cassandra mechanism TTL. Basically the data would go away after it's time expires. From my experience raw data is rarely needed for period longer than couple of months.

i.e. on one project we used a relational database to store the data after the ETL was done because queries were much simpler to do and there was no learning curve for data analysts. We kept the data after the ETL was finished in a so called star schema. This worked really well for us.

Basically I would advise you to think on how to aggregate the data and then make additional tables in cassandra for reports only. That way you would save yourself a lot of processing time.

One additional thing that you have to take into account is the sensor delay. Sometimes due to connectivity issues the sensors will spend even days offline. So you have to have some sort of policy to handle out of order data when it comes to etl.

Simples one is to ignore out of order data. Something in between is having a reasonable delay before you start etl jobs. i.e. start processing data couple of hours after the midnight so that you make sure the data is in and then you do ETL for whole day before. The most complicated one is updating the ETL aggregated data after you find something out of the order and then re processing but I would advise not to use this.

Bottom line is that I think having additional month table would not be beneficial because it would contain the same data and the access patter would not be that different.

Marko Švaljek
  • 2,071
  • 1
  • 14
  • 26
  • Thanks for quick response. I would like discribe why we are using filter.Every day the job executing previous day calculation.e – kannadhasan Apr 17 '17 at 15:44
  • Sure just add the details and I can update the answer – Marko Švaljek Apr 17 '17 at 15:47
  • Data is growing day by day.How we can design and organize data even better than what we have. My thought is For all data will be stored into two table 1.For day wise data will be stored in SensorData table.Every month table will be truncated.(if it is current month job need to run using this table) CREATE TABLE SensorData( cid text, event_date date, event_time timestamp, data text, device_id text, device_type text, rawdata text, PRIMARY KEY ((cid, event_date), event_time, device_id, device_type) ) WITH CLUSTERING ORDER BY (event_time DESC) – kannadhasan Apr 17 '17 at 16:28
  • 2.For historical data processing if it is previous month CREATE TABLE SensorData_YYYYMM ( cid text, event_date text, event_time timestamp, data text, device_id text, device_type text, rawdata text, RIMARY KEY ((cid, event_date), event_time, device_id, device_type) ) WITH CLUSTERING ORDER BY (event_time DESC) What is your thought! – kannadhasan Apr 17 '17 at 16:28
  • Svalijek, for the schema which you propose ((cid, device_id, event_date) event_time, device_type .One client may have 200 device .In that cases we need to fire read query by device wise(combined partition keys cid+deviceid+eventdata) and combine all together in Single RDD for etl one day. Thats we kept it as outside partition key. – kannadhasan Apr 18 '17 at 06:20
  • Yes that was what I wrote about. But firing a query before is not that expensive plus it can be done only once per job. I used this many times and it works fine. 200 devices in single bucket I wouldn't recommend, should your project become successful and if you get thousands of devices the approach of putting all client data into single partition will not scale. We also did something similar on an erlier project and we had to break the data. And splitting the data by sensor is kind of natural and it makes your life much easier. – Marko Švaljek Apr 18 '17 at 06:31