0

I have categorical log data in bigquery database that I want to process based on sliding window. I want to apply MODE operation on a window of size 3 or 5, so that one-off events or category changes are discarded.

|SysDT | Power_State | Target |
| -------- | -------- | -------- |
|2021-07-01 09:03:57+00:00| EDC | EDC   |
|2021-07-01 09:08:57+00:00| EDC | EDC   |
|2021-07-01 09:13:57+00:00| DWN | EDC   |
|2021-07-01 09:18:57+00:00| EDC | EDC   |
|2021-07-01 09:23:58+00:00| EDC | EDC   |
|2021-07-01 09:28:59+00:00| DWN | EDC   |
|2021-07-01 09:33:59+00:00| EDC | EDC   |

I try to use the OVER operator that gives me the required sliding window but next I need a custom MODE operator. Any idea in modifying this query to avoid such MODE function or writing a custom MODE function in bigquery?

SELECT *, MODE(Power_State) 
    OVER(ORDER BY SysDT ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as Target
FROM Master_Data.2021_07
ORDER BY SysDT

Any help really appreciated. Thanks

Aamir M. Khan
  • 138
  • 1
  • 8

1 Answers1

1

Consider below approach

select sysdt, power_state,
  ( select approx_top_count(state, 1)[offset(0)].value
    from unnest(arr) state
  ) as target
from (
  select *, array_agg(power_state) over win arr
  from your_table
  window win as (order by sysdt rows between 1 preceding and 1 following)
)
-- order by sysdt      

if applied to sample data in your question

with your_table as (
  select '2021-07-01 09:03:57+00:00' sysdt, 'EDC' power_state union all 
  select '2021-07-01 09:08:57+00:00', 'EDC' union all 
  select '2021-07-01 09:13:57+00:00', 'DWN' union all 
  select '2021-07-01 09:18:57+00:00', 'EDC' union all 
  select '2021-07-01 09:23:58+00:00', 'EDC' union all 
  select '2021-07-01 09:28:59+00:00', 'DWN' union all 
  select '2021-07-01 09:33:59+00:00', 'EDC' 
)           

output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thanks Mikhail. It works like a charm. To better understand how it works can you please explain it a bit? The inner query returns a window array called 'arr'. What happens in the outer query? – Aamir M. Khan Dec 25 '21 at 23:12
  • sure. `approx_top_count` is applied to that `arr` and returns array of top elements along with respective counts. even though it has just one element here - you still need to extract it using `[offset(0)]` and then you get actual value of most frequent one – Mikhail Berlyant Dec 25 '21 at 23:15