Cassandra aggregation

Question

The Cassandra database is not very good for aggregation and that is why I decided to do the aggregation before write. I am storing some data (eg. transaction) for each user which I am aggregating by hour. That means for one user there will be only one row for each our.

Whenever I receive new data, I read the row for current hour, aggregate it with received data and write it back.I use this data to generate hourly reports.

This works fine with low velocity data but I observed considerably high data loss when velocity is very high (eg 100 records for 1 user in a min). This is because reads and writes are happening very fast and because of "delayed write", I am not getting updated data.

I think my approach "aggregate before write" itself is wrong. I was thinking about UDF but I am not sure how will it impact on performance.

What is the best way to store aggregated data in Cassandra ?

What kind of aggregates and how much would be aggregated when reading. How often is it reading? Have you considered aggregating them in batches like Spark Streaming? — Chris Lohfink, Jan 04 '17 at 15:25
In my test environment I am writing 200 records in few seconds (wrote a script which sends data continuously). I reads the row before every insert. I am running this script after every 5 min. But in production environment I am expecting higher velocity than this. In worst case there could be 1000 records coming at the same time. — RhishiM88, Jan 07 '17 at 13:57
"Best Practices" are off-topic **Opinion Based** and **Too Broad** Please read [What types of questions should I avoid asking?](http://stackoverflow.com/help/dont-ask) before attempting to ask more questions. — , Mar 20 '17 at 17:39

score 0 · Answer 1 · answered Jan 04 '17 at 10:57

My idea would be:

Model data in Cassandra on hour-by-hour buckets.
Store plain data into Cassandra immediately when they arrive.
Process at X all the data of the X-1 hour and store the aggregate result on another table

This would allow you to have very fast incoming rates, process data only once, store the aggregates into another table to have fast reads.

score 0 · Answer 2 · answered Jan 04 '17 at 11:02

0

I use Cassandra to pre-aggregate also. I have different tables for hourly, daily, weekly, and monthly. I think you are probably getting data loss as you are selecting the data before your last inserts have replicated to other nodes.

Look into the counter data type to get around this.

You may also be able to specify a higher consistency level in either the inserts or selects to ensure you're getting the most recent data.

answered Jan 04 '17 at 11:02

Jim Wright

5,905
1
15
34

Even I will be having different tables for hourly, daily, weekly, and monthly reports. The problem is I have 10 different type of reports and each type of report will have hourly, daily, weekly, and monthly filters. Would it be wise to create so many tables ? – RhishiM88 Jan 07 '17 at 14:03
4 denormalised tables per report doesn't sound too bad to me. If you were on Cassandra 3.0+ you could look at materialized views. – Jim Wright Jan 09 '17 at 09:15

Cassandra aggregation

2 Answers2