From what I can tell you seem to be o.k. when it comes to schema.
If in the future you might get messages that are on a millisecond
level you might want to partition even lower than on a day level
that you have now.
But day is probably o.k. since sensors rarely
send the data in faster than seconds. I even worked on one project
where we partitioned by month, and the data came in in seconds
and it wasn't that much of a deal. So from the schema par you are
o.k.
Spark jobs also seem to be covered by the schema.
- is ok since you can get all the data for a day without
too much trouble
- I would avoid apply filtering especially if you have 10 GB per day
with time this will only get worse. If you provide some details
about why you need filtering I might help. My honest advice would be
to avoid it all together.
- This one requires you to iterate over date partitions. I guess
my best tip would be to simply go back into history day by day. And
you need a smart termination condition. Either fixed for all the
clients (i.e. don't go into past for let's say longer than x month). Or
you can make it smarter, i.e. when you go into "all" history for client
you stop after let's say 10 days buckets are empty. But this might
be tricky is some clients have longer outages. Anyway, you should make
this configurable.
- This might be a good answer, but If you are using spark already this
shouldn't be a problem.
With cassandra It's better to simply prepare the data upfront. So your
schema works o.k. for 1 and 2 you are fine. 3 is also fine but 4 is
always kind of tricky. By design if you add 10 GB to a set every day
and you want to process all of it, it will take longer and longer every
day. There isn't really much you can do about it if you want all the data.
Usually in this situation you make some sort of etl that already makes
let's say sum and average information that you need for specific time units.
i.e. if your report is for a whole day, you make new entries in cassandra
for that day and store the result. That way you won't have to reprocess it
again every single time. So your problem is not multiple smaller table but
the way you are designing your ETL operations.
For backups I would suggest usual cassandra work flow. What you provided
in the link works fine. Never had any problems with it. Also I wrote
some tools that exported stuff into csv but this was more for other clients
and companies that wanted to do their own processing on the data we had.
Updated answer after additional questions:
Q1: How about having data per day that will be truncated monthly
CREATE TABLE SensorData(
cid text,
event_date date,
event_time timestamp,
data text,
device_id text,
device_type text,
rawdata text,
PRIMARY KEY ((cid, event_date), event_time, device_id, device_type)
) WITH CLUSTERING ORDER BY (event_time DESC)
Q2: Does it make sense to create following table for historical processing:
CREATE TABLE SensorData_YYYYMM (
cid text,
event_date text,
event_time timestamp,
data text,
device_id text,
device_type text,
rawdata text, PRIMARY KEY ((cid, event_date), event_time, device_id, device_type)
) WITH CLUSTERING ORDER BY (event_time DESC)
The idea itself is not that bad but I have couple of concerns. First one
is that you would put all the data for one client day into single partition.
Depending on the number of clients that you would get this might get too big.
Usually in IOT you want to keep data from single sensor in a single partition.
And you add some time dimension key to the partition key. This makes it relatively easy to make etl jobs.
So basically the key for first table might be ((cid, device_id, event_date) event_time, device_type
Second is that if you are ever anticipating that two messages from one device
could come into system with same millicesond, you might loose data. So I would
advise you to go with timeuuid
type for event_time. Yes this takes more
space but you are safe against all the scenarios where you might loose some
readings in the future (when new clients will come in, you never know how often
they will send). With timeuuid you are even safe if for some reason the device
will aggregate multiple messages to save on bandwidth.
One problem that you would have with the first table in case that I described
is that it might become problematic to know all the device_id
when you will
go over it with an ETL. I would suggest to have one table where there would
be a list of all the device_id
for a single client. And every time you
provision a sensor for a client you also make write to this table. Then when
you are doing aggregations i.e. in spark you can easily combine the device_id
with cid
and event_date
to isolate the partitions that you need. You should
always avoid spark going over all the entries in the table that's just too expensive
and in my opinion this is one way to limit data traversal. This worked really
well on one project I worked.
I'll now start touching on the question 2 but will also refer to question 1.
The thing is that usually I would not recommend keeping the raw data once again.
This wouldn't make data management much easier. I would recommend just using the
standard cassandra mechanism TTL. Basically the data would go away after it's time
expires. From my experience raw data is rarely needed for period longer than
couple of months.
i.e. on one project we used a relational database to store the data after the ETL
was done because queries were much simpler to do and there was no learning curve
for data analysts. We kept the data after the ETL was finished in a so called
star schema. This worked really well for us.
Basically I would advise you to think on how to aggregate the data and then
make additional tables in cassandra for reports only. That way you would
save yourself a lot of processing time.
One additional thing that you have to take into account is the sensor delay.
Sometimes due to connectivity issues the sensors will spend even days offline.
So you have to have some sort of policy to handle out of order data when it
comes to etl.
Simples one is to ignore out of order data. Something in between is having a
reasonable delay before you start etl jobs. i.e. start processing data couple of hours
after the midnight so that you make sure the data is in and then you do ETL
for whole day before. The most complicated one is updating the ETL aggregated
data after you find something out of the order and then re processing but I would
advise not to use this.
Bottom line is that I think having additional month table would not be beneficial
because it would contain the same data and the access patter would not be that
different.