I am trying to use Cassandra to store data from some sensors. I read lots of articles about time series data model for Cassandra. I started from Getting Started with Time Series Data Modeling and "Time Series Pattern 2" looked like the best way to go. So I created a keyspace with a replication factor of 2 and a table like this
CREATE TABLE sensors_radio.draw (
dvid uuid,
bucket_time date,
utc_time double,
fft_size int,
n_avg int,
n_blocks int,
power double,
sample_rate double,
start_freq double,
PRIMARY KEY ((dvid, bucket_time), utc_time)
where dvid
is a unique device id, bucket_time
is a day (e.g. 2017-08-30) and utc_time
is a timestamp.
My query is
SELECT utc_time,start_freq,sample_rate,fft_size,n_avg,n_blocks,power
FROM sensors_radio.draw
WHERE dvid=<dvid>
AND bucket_time IN (<list-of-days>)
AND utc_time>=1.4988002E9
AND utc_time<1.4988734E9;
As you can see I need to retrieve data from multiple days, which means reading multiple partitions across my cluster. Query performance looks poor in my opinion and this is understandable because of the IN anti-pattern.
EDIT: I tried to avoid the IN anti-pattern by splitting my query into multiple ones, but I get no performance improvement.
I thought about increasing my partition size by using a month instead of a day as bucket_time
to query a single partition with my query.
But I fear the partition would grow too much! By reading the answers to this question, I figured that in a month my partition would have roughly 500 millions cells (so way less than the 2 billion limit), but of course it would exceed the 100MB size limit and the 100000 rows limit.
What is the recommended data model in this scenario? Are large disk size partitions a problem?
Thanks in advance.
Ps. I am using Cassandra 3.10 on a cluster made of 3 nodes (8 cores, 16GB ram)