Hive SQL aggregate merge multiple sqls into one

Question

I have a serial sqls like:

select count(distinct userId) from table where hour >= 0 and hour <= 0;
select count(distinct userId) from table where hour >= 0 and hour <= 1;
select count(distinct userId) from table where hour >= 0 and hour <= 2;
...
select count(distinct userId) from table where hour >= 0 and hour <= 14;

Is there a way to merge them into one sql?

Try to use case when operation – Strick Nov 07 '19 at 14:33 — Strick, Nov 07 '19 at 14:33

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

It looks like you are trying to keep a cumulative count, bracketed by the hour. To do that, you can use a window function, like this:

SELECT DISTINCT
  A.hour AS hour,
  SUM(COALESCE(M.include, 0)) OVER (ORDER BY A.hour) AS cumulative_count
FROM ( -- get all records, with 0 for include
  SELECT
    name,
    hour,
    0 AS include
  FROM
    table
  ) A
  LEFT JOIN
  ( -- get the record with lowest `hour` for each `name`, and 1 for include
    SELECT
      name,
      MIN(hour) AS hour,
      1 AS include
    FROM 
      table
    GROUP BY
      name
  ) M
  ON  M.name = A.name
  AND M.hour = A.hour
;

There might be a simpler way, but this should yield the correct answer in general.

Explanation:

This uses 2 subqueries against the same input table, with a derived field called include to keep track of which records should contribute to the final total for each bucket. The first subquery simply takes all records in the table and assigns 0 AS include. The second subquery finds all unique names and the lowest hour slot in which that name appears, and assigns them 1 AS include. The 2 subqueries are LEFT JOIN'ed by the enclosing query.

The outermost query does a COALESCE(M.include, 0) to fill in any NULL's produced by the LEFT JOIN, and those 1's and 0's are SUM'ed and windowed by hour. This needs to be a SELECT DISTINCT rather than using a GROUP BY becuse a GROUP BY will want both hour and include listed, but it ends up collapsing every record in a given hour group into a single row (still with include=1). The DISTINCT is applied after the SUM so it will remove duplicates without discarding any input rows.

Also distinct count is not additive. So distinct count for hour between 0 and 1 <> distinct count for hour 0 + distinct count for hour=1 — leftjoin, Nov 07 '19 at 09:33
It should be distinct userid, not distinct hour+count. What if the same user is twice in different hours — leftjoin, Nov 11 '19 at 13:36
@leftjoin I updated this answer. This produces exactly the result I believe OP is after, and I think it takes care of your concerns as well. — Z4-tier, Nov 15 '19 at 14:12

Hive SQL aggregate merge multiple sqls into one

1 Answers1

Explanation: