0

The problem I am trying to address seems to be trivial. I have huge collections of events (actually they come from mobile app so they are mobile events). Each event is described by several attributes:

 operating_system create_time version resolution model brand network_type etc.

I have those events stored on the hdfs, the problem I am trying to solve is to allow user to analyse those events in near real time. By analysing I mean to be able to select only specific columns, interesting date range and to see how many events come from different phone models. For example lets assume I have following dataset:

 os1 2015-07-30 v1 200x200 model1 brand1 provider1
 os1 2015-07-30 v1 200x200 model1 brand1 provider1
 os1 2015-07-30 v1 200x200 model1 brand1 provider2
 os1 2015-07-30 v1 200x200 model1 brand2 provider2
 os1 2015-07-29 v1 200x200 model1 brand1 provider1
 os2 2015-07-30 v1 200x200 model1 brand1 provider1
 os1 2015-06-30 v1 200x200 model1 brand1 provider1

Lets also assume that user wants to find the number of events from different phones from july 2015. The answer he is looking for looks as follows:

 os1 2015-07-30 v1 200x200 model1 brand1 provider1 4
 os1 2015-07-30 v1 200x200 model1 brand1 provider2 1
 os1 2015-07-30 v1 200x200 model1 brand2 provider2 1

Because the number of events is huge I tried to calculate aggregates and store them in cassandra. Aggregates were calculated per day, giving the previous example dataset my aggregates would look like this:

 os1 2015-06-30 v1 200x200 model1 brand1 provider1 1
 os1 2015-07-29 v1 200x200 model1 brand1 provider1 1
 os1 2015-07-30 v1 200x200 model1 brand1 provider1 3
 os1 2015-07-30 v1 200x200 model1 brand1 provider2 1
 os1 2015-07-30 v1 200x200 model1 brand2 provider2 1

The problem is that are still too many of them. I still need spark to run on-demand task to sum aggregates from requested date range. It is slow and it requires a lot of network transfer. I read a lot about HyperLogLog and other similar algorithms but I don't see how I can use them here. I don't really care about exact result, estimations are pretty good for me. Can anyone suggest what I can do?

homar
  • 575
  • 1
  • 7
  • 19

1 Answers1

0

Add additional field to your data. This additional field will break down your data to smaller bricks of data (we call it binning the data). For example 1000 record give it a single bin. Then do the aggregation inside each bin. Like:

1 os1 2015-06-30 v1 200x200 model1 brand1 provider1 1
1 os1 2015-07-29 v1 200x200 model1 brand1 provider1 1
1 os1 2015-07-30 v1 200x200 model1 brand1 provider1 3
.
.
2 os1 2015-07-30 v1 200x200 model1 brand1 provider2 1
2 os1 2015-07-30 v1 200x200 model1 brand2 provider2 1
.

This will reduce your shuffling a lot and give you approximate result. for full result, do extra step that aggregates the results from the bins.

Abdulrahman
  • 433
  • 4
  • 11
  • Think of it as a kind directing the parallelism. when you can have similar data (same bin) in a single machine (reduce shuffle). @homar – Abdulrahman Aug 01 '15 at 01:26