2

I have the following table:

Year Week Day_1 Day_2 Day_3
2020 1 Walk Jump Swim
2020 3 Walk Swim Walk
2020 1 Jump Walk Swim

I want to group by YEAR, WEEK and Event (Walk, jump, Swim) and count the number of times each event occurs in Day_1, Day_2, Day_3. I.e.

Year Week Event Count_Day_1 Count_Day_2 Count_Day_3
2020 1 Walk 1 1 0
2020 3 Walk 1 0 1
2020 1 Jump 1 1 0
2020 3 Jump 0 0 0
2020 1 Swim 0 0 2
2020 3 Swim . 0 1 0

How can I do this efficiently?

Joe
  • 175
  • 3
  • 10
  • As a possible solution: [this answer on UNPIVOT](https://stackoverflow.com/a/27832362/2778710) accompanied with conditional aggregation may be helpful. – astentx Feb 05 '21 at 11:33

4 Answers4

4

In BigQuery, I would unpivot using arrays and then aggregate:

with t as (
       select 2020 as year, 1 as week, 'Walk' as day_1, 'Jump' as day_2, 'Swim' as day_3 union all
       select 2020, 3, 'Walk', 'Swim', 'Walk' union all
       select 2020, 1, 'Jump', 'Walk', 'Swim'
      )
select t.year, t.week, s.event,
       countif(day = 1) as day_1, countif(day = 2) as day_2, countif(day = 3) as day_3
from t cross join
     unnest([struct(t.day_1 as event, 1 as day),
             struct(t.day_2 as event, 2 as day),
             struct(t.day_3 as event, 3 as day)
            ]) s
group by t.year, t.week, s.event;
              
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Consider this less verbose option

select year, week, event, 
  countif(offset = 0) as day_1, 
  countif(offset = 1) as day_2, 
  countif(offset = 2) as day_3
from `project.dataset.table`,
unnest([day_1, day_2, day_3]) event with offset
where not event is null
group by year, week, event   

If applied to sample data in your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

You need to find distinct event, do cross join with your table and use conditional aggregation as follows:

select t.year, t.week, e.event, 
       count(case when t.day_1 = e.event then 1 end) as count_day_1,
       count(case when t.day_2 = e.event then 1 end) as count_day_2,
       count(case when t.day_3 = e.event then 1 end) as count_day_3 
  from your_Table t
  cross join (select distinct day_1 as event from your_table
              union all select day_2 from your_table
              union all select day_3 from your_table) e
group by t.year, t.week, e.event
astentx
  • 6,393
  • 2
  • 16
  • 25
Popeye
  • 35,427
  • 4
  • 10
  • 31
  • There's no `event` column in the source table, so there should be unpivoting logic inside `cross join` – astentx Feb 05 '21 at 11:34
  • Right @astentx. Updated the query. Please check – Popeye Feb 05 '21 at 11:36
  • 1
    I've edited, should work: 1) `e.day_N` should be `t.day_N`; 2) BigQuery [throws an error](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#union_example) for `UNION` and asks to do `select distinct ... union all`. I'ts interesting! – astentx Feb 05 '21 at 11:51
0

Demo code is MS SQL!

If you want to generate a full grid for every week and every year for every event then there are two pre-aggregates required, one for event and another one for every year and week.

Like:

DECLARE
  @OriginalData
  TABLE
  (
    numYear   smallint,
    numWeek   tinyint,
    dscDay1   nvarchar(20),
    dscDay2   nvarchar(20),
    dscDay3   nvarchar(20)
  )
;

INSERT INTO
  @OriginalData
(
  numYear, numWeek, dscDay1, dscDay2, dscDay3
)
VALUES
  ( 2020, 1, N'Walk', N'Jump', N'Swim' ),
  ( 2020, 3, N'Walk', N'Swim', N'Walk' ),
  ( 2020, 1, N'Jump', N'Walk', N'Swim' )
;

SELECT
  numYear, numWeek, dscDay1, dscDay2, dscDay3
FROM
  @OriginalData
;

WITH
  cteNormalise
(
  dscActivity
)
AS
(
  SELECT
    dscDay1
  FROM
    @OriginalData
  GROUP BY
    dscDay1
  UNION
  SELECT
    dscDay2
  FROM
    @OriginalData
  GROUP BY
    dscDay2
  UNION
  SELECT
    dscDay3
  FROM 
    @OriginalData
  GROUP BY
    dscDay3
),
  cteGrid
(
  numYear,
  numWeek
)
AS
(
  SELECT
    numYear,
    numWeek
  FROM
    @OriginalData
  GROUP BY
    numYear,
    numWeek
)
SELECT
  --/* Debug output */ *
  YearWeek.numYear,
  YearWeek.numWeek,
  Normalised.dscActivity,
  Count( Day1.dscDay1 ) AS CountDay1,
  Count( Day2.dscDay2 ) AS CountDay2,
  Count( Day3.dscDay3 ) AS CountDay3
FROM
  cteNormalise AS Normalised
  CROSS JOIN cteGrid AS YearWeek
  LEFT OUTER JOIN @OriginalData AS Day1
    ON  Day1.dscDay1 = Normalised.dscActivity
    AND Day1.numYear = YearWeek.numYear
    AND Day1.numWeek = YearWeek.numWeek
  LEFT OUTER JOIN @OriginalData AS Day2
    ON  Day2.dscDay2 = Normalised.dscActivity
    AND Day2.numYear = YearWeek.numYear
    AND Day2.numWeek = YearWeek.numWeek
  LEFT OUTER JOIN @OriginalData AS Day3
    ON  Day3.dscDay3 = Normalised.dscActivity
    AND Day3.numYear = YearWeek.numYear
    AND Day3.numWeek = YearWeek.numWeek
GROUP BY
  YearWeek.numYear,
  YearWeek.numWeek,
  Normalised.dscActivity
ORDER BY
  YearWeek.numYear,
  Normalised.dscActivity,
  YearWeek.numWeek
;

This will work, however efficiency is questionable due to the steps to normalise the data before the actual aggregation happens.

If possible I suggest converting the table first into a 3NF with just key columns of Year, Week, Event and Day. Then a fairly efficient summary can be produced. At the cost of the normalisation beforehand. Otherwise the cost of transformation is required in the query.

Knut Boehnert
  • 488
  • 5
  • 10
  • It is hard to port this T-SQL stuff with insert into table variable to other DBMSes. You can use `with` clause for this to make it portable to almost all SQL dialects, so it's better to rewrite the query to make this answer more or less helpful without code rewriting – astentx Feb 05 '21 at 11:45