0

Based on the BigQuery best practice of using ARRAY_AGG() to obtain the latest record, here's how I'm getting the first, last minimum and maximum values of a field for a day. The data is reported approximately hourly.

    WITH t AS (
    SELECT TIMESTAMP('2021-01-01 01:00:00') as l, 10 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-01 02:00:00') as l, 12 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-01 03:00:00') as l, 15 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-01 04:00:00') as l, 2 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-02 01:00:00') as l, 600 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-02 02:00:00') as l, 120 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-02 03:00:00') as l, 150 as v
    UNION ALL
    SELECT TIMESTAMP('2021-01-03 04:00:00') as l, 0 as v)

    SELECT EXTRACT(DATE FROM l) d,
      ARRAY_AGG(t.v ORDER BY t.l ASC LIMIT 1)[OFFSET(0)] first_value,
      ARRAY_AGG(t.v ORDER BY t.l DESC LIMIT 1)[OFFSET(0)] last_value,
      ARRAY_AGG(t.v ORDER BY t.v DESC LIMIT 1)[OFFSET(0)] max_value,
      ARRAY_AGG(t.v ORDER BY t.v ASC LIMIT 1)[OFFSET(0)] min_value,
      FROM
        t
      GROUP BY
        d

Output:

Row d max_value min_value last_value first_value
1 2021-01-01 15 2 2 10
2 2021-01-02 600 120 150 600
3 2021-01-03 0 0 0 0

Since there are only six BigQuery questions on Code Review, I thought I'd ask here on the main Stack Overflow. Is this the fastest method? Do I have anything extraneous in my query? (I'm not too sure that [OFFSET(0)] is doing anything.)

I've seen this question asked on Stack Overflow for Oracle, T-SQL and Postgres but I haven't seen anything specific for BigQuery. Thanks!

Martin Burch
  • 2,726
  • 4
  • 31
  • 59

1 Answers1

1

Obvious improvement is to use simple MIN and MAX for min_value and max_value

select date(l) d,
  array_agg(v order by l asc limit 1)[offset(0)] first_value,
  array_agg(v order by l desc limit 1)[offset(0)] last_value,
  max(v) max_value,
  min(v) min_value
from t
group by d          

Rather than this, using array_agg is a good practice here and using [offset(0)] is important here as without it - your outputs will be arrays with one elements - but you most likely want the element itself out

One more - depends on the volume of your data - you can try below approach which uses analytic aggregation functions vs. just aggregation functions

select distinct * from (
  select date(l) d,
    first_value(v) over(partition by date(l) order by l asc) first_value,
    first_value(v) over(partition by date(l) order by l desc) last_value,
    max(v) over(partition by date(l)) max_value,
    min(v) over(partition by date(l)) min_value
  from t
)    

More options to consider - using approximate aggregate functions as in below example

select extract(date from l) d,
  approx_top_sum(v, 1 / unix_seconds(l), 1)[offset(0)].value first_value,
  approx_top_sum(v, unix_seconds(l), 1)[offset(0)].value last_value,
  max(v) max_value,
  min(v) min_value,
from t
group by d
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thanks, that's a good spot on the max() and min(). Is there any difference between `EXTRACT(DATE from l)` and `DATE(l)` in terms of execution time? There is a difference in the execution details, because extract_date() is the function called if extract is used, whereas date() is the function called if date() is used. – Martin Burch Jan 24 '22 at 22:46
  • `date(ts)` vs `extract(date from ts)`? i would use former as it is easier/faster to type and more descriptive, etc. as of execution time / performance - I doubt there is much difference and if any - most likely not material - but don't know really as it depends on BQ engine implementation details! – Mikhail Berlyant Jan 24 '22 at 22:59