Efficient Multiple Group-bys

Question

I have the following table:

Year	Week	Day_1	Day_2	Day_3
2020	1	Walk	Jump	Swim
2020	3	Walk	Swim	Walk
2020	1	Jump	Walk	Swim

I want to group by YEAR, WEEK and Event (Walk, jump, Swim) and count the number of times each event occurs in Day_1, Day_2, Day_3. I.e.

Year	Week	Event	Count_Day_1	Count_Day_2	Count_Day_3
2020	1	Walk	1	1	0
2020	3	Walk	1	0	1
2020	1	Jump	1	1	0
2020	3	Jump	0	0	0
2020	1	Swim	0	0	2
2020	3	Swim	. 0	1	0

How can I do this efficiently?

As a possible solution: [this answer on UNPIVOT](https://stackoverflow.com/a/27832362/2778710) accompanied with conditional aggregation may be helpful. — astentx, Feb 05 '21 at 11:33

score 4 · Accepted Answer · answered Feb 05 '21 at 11:43

In BigQuery, I would unpivot using arrays and then aggregate:

with t as (
       select 2020 as year, 1 as week, 'Walk' as day_1, 'Jump' as day_2, 'Swim' as day_3 union all
       select 2020, 3, 'Walk', 'Swim', 'Walk' union all
       select 2020, 1, 'Jump', 'Walk', 'Swim'
      )
select t.year, t.week, s.event,
       countif(day = 1) as day_1, countif(day = 2) as day_2, countif(day = 3) as day_3
from t cross join
     unnest([struct(t.day_1 as event, 1 as day),
             struct(t.day_2 as event, 2 as day),
             struct(t.day_3 as event, 3 as day)
            ]) s
group by t.year, t.week, s.event;

score 1 · Answer 2 · answered Feb 06 '21 at 01:12

Consider this less verbose option

select year, week, event, 
  countif(offset = 0) as day_1, 
  countif(offset = 1) as day_2, 
  countif(offset = 2) as day_3
from `project.dataset.table`,
unnest([day_1, day_2, day_3]) event with offset
where not event is null
group by year, week, event

If applied to sample data in your question - output is

score 0 · Answer 3 · edited Feb 05 '21 at 12:04

0

You need to find distinct event, do cross join with your table and use conditional aggregation as follows:

select t.year, t.week, e.event, 
       count(case when t.day_1 = e.event then 1 end) as count_day_1,
       count(case when t.day_2 = e.event then 1 end) as count_day_2,
       count(case when t.day_3 = e.event then 1 end) as count_day_3 
  from your_Table t
  cross join (select distinct day_1 as event from your_table
              union all select day_2 from your_table
              union all select day_3 from your_table) e
group by t.year, t.week, e.event

edited Feb 05 '21 at 12:04

astentx

6,393
2
16
25

answered Feb 05 '21 at 11:25

Popeye

35,427
4
10
31

There's no `event` column in the source table, so there should be unpivoting logic inside `cross join` – astentx Feb 05 '21 at 11:34
Right @astentx. Updated the query. Please check – Popeye Feb 05 '21 at 11:36
1

I've edited, should work: 1) `e.day_N` should be `t.day_N`; 2) BigQuery [throws an error](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#union_example) for `UNION` and asks to do `select distinct ... union all`. I'ts interesting! – astentx Feb 05 '21 at 11:51

score 0 · Answer 4 · answered Feb 05 '21 at 11:35

Demo code is MS SQL!

If you want to generate a full grid for every week and every year for every event then there are two pre-aggregates required, one for event and another one for every year and week.

Like:

DECLARE
  @OriginalData
  TABLE
  (
    numYear   smallint,
    numWeek   tinyint,
    dscDay1   nvarchar(20),
    dscDay2   nvarchar(20),
    dscDay3   nvarchar(20)
  )
;

INSERT INTO
  @OriginalData
(
  numYear, numWeek, dscDay1, dscDay2, dscDay3
)
VALUES
  ( 2020, 1, N'Walk', N'Jump', N'Swim' ),
  ( 2020, 3, N'Walk', N'Swim', N'Walk' ),
  ( 2020, 1, N'Jump', N'Walk', N'Swim' )
;

SELECT
  numYear, numWeek, dscDay1, dscDay2, dscDay3
FROM
  @OriginalData
;

WITH
  cteNormalise
(
  dscActivity
)
AS
(
  SELECT
    dscDay1
  FROM
    @OriginalData
  GROUP BY
    dscDay1
  UNION
  SELECT
    dscDay2
  FROM
    @OriginalData
  GROUP BY
    dscDay2
  UNION
  SELECT
    dscDay3
  FROM 
    @OriginalData
  GROUP BY
    dscDay3
),
  cteGrid
(
  numYear,
  numWeek
)
AS
(
  SELECT
    numYear,
    numWeek
  FROM
    @OriginalData
  GROUP BY
    numYear,
    numWeek
)
SELECT
  --/* Debug output */ *
  YearWeek.numYear,
  YearWeek.numWeek,
  Normalised.dscActivity,
  Count( Day1.dscDay1 ) AS CountDay1,
  Count( Day2.dscDay2 ) AS CountDay2,
  Count( Day3.dscDay3 ) AS CountDay3
FROM
  cteNormalise AS Normalised
  CROSS JOIN cteGrid AS YearWeek
  LEFT OUTER JOIN @OriginalData AS Day1
    ON  Day1.dscDay1 = Normalised.dscActivity
    AND Day1.numYear = YearWeek.numYear
    AND Day1.numWeek = YearWeek.numWeek
  LEFT OUTER JOIN @OriginalData AS Day2
    ON  Day2.dscDay2 = Normalised.dscActivity
    AND Day2.numYear = YearWeek.numYear
    AND Day2.numWeek = YearWeek.numWeek
  LEFT OUTER JOIN @OriginalData AS Day3
    ON  Day3.dscDay3 = Normalised.dscActivity
    AND Day3.numYear = YearWeek.numYear
    AND Day3.numWeek = YearWeek.numWeek
GROUP BY
  YearWeek.numYear,
  YearWeek.numWeek,
  Normalised.dscActivity
ORDER BY
  YearWeek.numYear,
  Normalised.dscActivity,
  YearWeek.numWeek
;

This will work, however efficiency is questionable due to the steps to normalise the data before the actual aggregation happens.

If possible I suggest converting the table first into a 3NF with just key columns of Year, Week, Event and Day. Then a fairly efficient summary can be produced. At the cost of the normalisation beforehand. Otherwise the cost of transformation is required in the query.

It is hard to port this T-SQL stuff with insert into table variable to other DBMSes. You can use `with` clause for this to make it portable to almost all SQL dialects, so it's better to rewrite the query to make this answer more or less helpful without code rewriting — astentx, Feb 05 '21 at 11:45

Efficient Multiple Group-bys

4 Answers4