The problem I am trying to address seems to be trivial. I have huge collections of events (actually they come from mobile app so they are mobile events). Each event is described by several attributes:
operating_system create_time version resolution model brand network_type etc.
I have those events stored on the hdfs, the problem I am trying to solve is to allow user to analyse those events in near real time. By analysing I mean to be able to select only specific columns, interesting date range and to see how many events come from different phone models. For example lets assume I have following dataset:
os1 2015-07-30 v1 200x200 model1 brand1 provider1
os1 2015-07-30 v1 200x200 model1 brand1 provider1
os1 2015-07-30 v1 200x200 model1 brand1 provider2
os1 2015-07-30 v1 200x200 model1 brand2 provider2
os1 2015-07-29 v1 200x200 model1 brand1 provider1
os2 2015-07-30 v1 200x200 model1 brand1 provider1
os1 2015-06-30 v1 200x200 model1 brand1 provider1
Lets also assume that user wants to find the number of events from different phones from july 2015. The answer he is looking for looks as follows:
os1 2015-07-30 v1 200x200 model1 brand1 provider1 4
os1 2015-07-30 v1 200x200 model1 brand1 provider2 1
os1 2015-07-30 v1 200x200 model1 brand2 provider2 1
Because the number of events is huge I tried to calculate aggregates and store them in cassandra. Aggregates were calculated per day, giving the previous example dataset my aggregates would look like this:
os1 2015-06-30 v1 200x200 model1 brand1 provider1 1
os1 2015-07-29 v1 200x200 model1 brand1 provider1 1
os1 2015-07-30 v1 200x200 model1 brand1 provider1 3
os1 2015-07-30 v1 200x200 model1 brand1 provider2 1
os1 2015-07-30 v1 200x200 model1 brand2 provider2 1
The problem is that are still too many of them. I still need spark to run on-demand task to sum aggregates from requested date range. It is slow and it requires a lot of network transfer. I read a lot about HyperLogLog and other similar algorithms but I don't see how I can use them here. I don't really care about exact result, estimations are pretty good for me. Can anyone suggest what I can do?