Cassandra for time series data: how to size the partition?

Question

I am trying to use Cassandra to store data from some sensors. I read lots of articles about time series data model for Cassandra. I started from Getting Started with Time Series Data Modeling and "Time Series Pattern 2" looked like the best way to go. So I created a keyspace with a replication factor of 2 and a table like this

CREATE TABLE sensors_radio.draw (
dvid uuid,
bucket_time date,
utc_time double,
fft_size int,
n_avg int,
n_blocks int,
power double,
sample_rate double,
start_freq double,
PRIMARY KEY ((dvid, bucket_time), utc_time)

where dvid is a unique device id, bucket_time is a day (e.g. 2017-08-30) and utc_time is a timestamp.

My query is

SELECT utc_time,start_freq,sample_rate,fft_size,n_avg,n_blocks,power
FROM sensors_radio.draw 
WHERE dvid=<dvid> 
AND bucket_time IN (<list-of-days>) 
AND utc_time>=1.4988002E9 
AND utc_time<1.4988734E9;

As you can see I need to retrieve data from multiple days, which means reading multiple partitions across my cluster. Query performance looks poor in my opinion and this is understandable because of the IN anti-pattern.

EDIT: I tried to avoid the IN anti-pattern by splitting my query into multiple ones, but I get no performance improvement.

I thought about increasing my partition size by using a month instead of a day as bucket_time to query a single partition with my query. But I fear the partition would grow too much! By reading the answers to this question, I figured that in a month my partition would have roughly 500 millions cells (so way less than the 2 billion limit), but of course it would exceed the 100MB size limit and the 100000 rows limit.

What is the recommended data model in this scenario? Are large disk size partitions a problem?

Thanks in advance.

Ps. I am using Cassandra 3.10 on a cluster made of 3 nodes (8 cores, 16GB ram)

IN is in fact an anti-pattern as multi partition queries are often 'too slow'. Also large partitions on disk cause some other issues to appear with compaction and read performance. — Mandraenke, Aug 30 '17 at 12:48
I highly recommend moving to 3.11 from 3.2, pre 3.9 in particular had a lot of issues. — Chris Lohfink, Aug 30 '17 at 13:17
Sorry, my mistake. I am using Cassandra 3.10. Do I lose my data if I update to 3.11? — Matteo Pergolesi, Aug 30 '17 at 13:21
Upgrading cassandra is pretty safe - keep an eye that your configs. Readd all your changes to the new cassandra.yaml and look for `nodetool upgradesstables` (see https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsUpgradeSstables.html) — Mandraenke, Aug 31 '17 at 10:07

score 3 · Answer 1 · answered Aug 30 '17 at 12:57

3

As you said, queries using IN can be quite slow as multiple partitions need to be read in your case but your query is handled from one coordinator node (which is often selected as a node who takes care of the partition if possible).

Also, large partitions have been a nightmare in the past - in 3.6 and onwards it should be not as bad (see https://de.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016). Read performance and memory pressure have been serious issues.

What worked really well for me - but depends on your use cases - go and use 'small enough' buckets (day) and issue just 31 queries for a month asynchronously and in parallel and join them back in your code. There are futures supporting you on that way in java for example. This way only a single bucket/partition is hit per query and most likely all nodes in your cluster process your queries in parallel.

answered Aug 30 '17 at 12:57

Mandraenke

3,086
1
13
26

Hi @Mandraenke, thanks for your answer. You are right about splitting your query into multiple queries. This is what I've already done after reading [this](https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/), but I have no performance improvement (exactly same execution time). I did not mention it because my question would get too complex. I am going to update it now. – Matteo Pergolesi Aug 30 '17 at 13:13
1

Did you execute them async? – Mandraenke Aug 30 '17 at 13:14
Yes, as stated in the article in the link. – Matteo Pergolesi Aug 30 '17 at 13:18
I tried this: I executed a query from Java splitted in 4, async, with futures. It takes ~40s to retrieve all the 2812033 rows. What do you think about that time? Is it reasonable? – Matteo Pergolesi Aug 30 '17 at 14:15
That's about 70k rows per second - if that's ok not not depends on row size - monitor your network bandwitdh and load on your cluster nodes and bandwith on your server issuing the queries. To point it out - are there still `IN` in your 4 queries? – Mandraenke Aug 30 '17 at 14:25
No, the 4 queries only contain `=` on the partition key fields. I am monitoring bandiwidth with `iftop`. I see transmission peaks of 40Mbps (link is 100Mbps between client and server). Core utilization never goes over 30%. – Matteo Pergolesi Aug 30 '17 at 14:37
Hey, sorry for being late but I was busy trying a lot of things with Cassandra and later I was distracted by other things at work. I tried a new data model with smaller partitions, but there was no improvement. One thing that boosted up performance was to switch the CPU governor of our servers from `ondemand` to `performance`. I guess it's something. – Matteo Pergolesi Sep 11 '17 at 08:38

score 0 · Answer 2 · edited Oct 02 '19 at 03:28

Actually you wrong to understand the meaning of column value size.

Limit about 2 Billion - it is not about number of rows, it is how works regular columns and cluster keys

Use this formula for that Nv=Nr(Nc−Npk−Ns)+Ns

The number of values (or cells) in the partition (Nv) is equal to the number of static columns (Ns) plus the product of the number of rows (Nr) and the number of of values per row. The number of values per row is defined as the number of columns (Nc) minus the number of primary key columns (Npk) and static columns (Ns).

Short description will be as number of rows multiply by number of regular columns

In your case it will be:

(500 000 000 * (9 - 3 - 0) + 0) = 3 000 000 000

So you out of limit 2 Billion

And formula to Calculating Size on Disk

And your partition size on disk will be a huge

(20 + 0 + (500000000 * 84) + (8 * 3000000000)) = 
66000000020 bytes (62942.50 Mb)

Obviously more than 100 Mb cassandra limit

I calculated it using my open source project - cql-calculator.

Cassandra for time series data: how to size the partition?

2 Answers2