I'm aggregating 4 dimensions on a hive table (tens of billions of rows), and they need to be rolled up or cubed. Say my table is about user interactions, and I will aggregate on what mobile application they use, what mobile os they have, etc. The sql looks like this:
select d1, d2, d3, d4, count(distinct userId) as uv, sum(1) as pv from T
group by cube(d1, d2, d3, d4)
Since the current implementation of grouping sets / rollup / cube explode the input (according to this jira), resulting in total 16 passes over the input, which is quite resource consuming.
My custom solution to this is to have a temporary table where I aggregate on user id and the 4 dimensions first, then do the rollup.
create table tmp as
select userId, d1, d2, d3, d4, sum(1) as pv from T
group by userId, d1, d2, d3, d4
then
select d1, d2, d3, d4, count(1) as uv, sum(pv) as pv
from tmp
group by cube(d1, d2, d3, d4)
Since the input data is huge, doing pre-aggreagates has significant reduce in runtime (reducing the input data size of the 16 passes).
So I wonder if hive/spark can have this rule built in, or people would normally deal with this problem otherwise that I'm not aware of?