I am working with Analytic of events, I use hadoop to process the logs and store some results in Mysql. This did not work now due to scalability issues as logs are keep coming daily.
We need to show stats per year, month, week, day, hour along with filtering capability
Our samples can grow for 100k users, each uses 20 websites each hour
100,000(users) * 20 (unique website) * 2 (locations) * 24 (hours) = 96,000,000 (96 Million max records per day)
Our table looks like
event_src_id, Time, User, Website, location, some stats
Some queries example are
1) select website, sum(stats), count(distinct(user_id)) from table group by website;
2) select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time);
3) select website, sum(stats), count(distinct(user_id)) from table group by website where event_src_id=XXXXXXXXXXX;
4) select website, sum(stats), count(distinct(user_id)) from table group by website where time > 1 jan 2014 and time <=31 jan 2014;
5) select website, location, sum(stats), count(distinct(user_id)) from table group by website, location;
6) select website, sum(stats) as stats_val from table group by website order by stats_val desc limit 10;
select location, sum(stats) as stats_val from table group by location order by stats_val desc limit 10;
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)
I tried Hadoop elastic search and its seems like insertion part can fixed with that, I am more worried on the reading part. The aggregation framework seems to give some hope but I could not work as per query one. how to group and sum and distinct at same time? How can I best use Elasticsearch with Hadoop with given scalability and performance for OLAP based quires. Any help will be appreciated.