0

I have to rewrite a script written for Snowflake into Databricks and need some help on how to replicate CONDITIONAL_TRUE_EVENT as Databricks doesn't have that function.

The need is for me to group events together if they have the same user and device and took place within 300 seconds (5 minutes) of each other.

CREATE TABLE events
     (
        event_timestamp timestamp,
        user_id bigint,
        device_id bigint
     );

INSERT INTO events VALUES
  ('2022-07-12 05:00:00',1,1),
  ('2022-07-12 05:03:00',1,1),
  ('2022-07-12 05:04:00',1,2),
  ('2022-07-12 05:05:00',1,2),
  ('2022-07-12 05:06:00',2,1),
  ('2022-07-12 05:07:00',1,1),
  ('2022-07-12 05:15:00',1,1);

SELECT event_timestamp, user_id, device_id, group_id
FROM events

should return

'2022-07-12 05:00:00',1,1,1
'2022-07-12 05:03:00',1,1,1
'2022-07-12 05:04:00',1,2,2
'2022-07-12 05:05:00',1,2,2
'2022-07-12 05:06:00',2,1,3
'2022-07-12 05:07:00',1,1,1
'2022-07-12 05:15:00',1,1,4

The first 3 instances where user_id = 1, device_id = 1 are all group_id = 1 because the next event is within 5 minute of the previous event except for the last one because (group_id = 4) because at 05:15:00 it is more than 5 minutes away from the previous event with user_id = 1, device_id = 1 (05:07:00).

It seems to me that I should be able to find some combination of LAG, CASE, and SUM to calculate the group_id, but I just cannot figure it out.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • somehow your result table doesn't correspond to your presented data – nbk Jul 12 '22 at 19:04
  • Could you post the original Snowflake query with CONDITIONAL_TRUE_EVENT that produces requested output? – Lukasz Szozda Jul 12 '22 at 19:12
  • @nbk, the group_id column is not the events table as you have noted. It is something I have to derive. – Andrew Park Jul 12 '22 at 19:43
  • @LukaszSzozda SELECT *, LAG(event_end_date_utc) OVER (PARTITION BY user_id,device_id ORDER BY event_start_date_utc) AS PREV_EVENT_END_DATE_TIME_UTC, COALESCE(TIMESTAMPDIFF('second',PREV_EVENT_END_DATE_TIME_UTC,event_start_date_utc),305) AS TIME_DIFF, CONDITIONAL_TRUE_EVENT(TIME_DIFF > 300) OVER (PARTITION BY user_id,device_id, ORDER BY event_start_date_utc) AS GROUP_COL – Andrew Park Jul 12 '22 at 19:46
  • @AndrewPark the sixth column is ` ('2022-07-12 05:07:00',2,1),` but in your rrsult it is different and group_id should be 3 – nbk Jul 12 '22 at 21:42
  • @nbk, you are correct. Sixth row should be 2022-07-12 05:07:00, 2, 1 based on the insert statement. So, as you pointed out, it should be grouped with the previous row into group_id = 3. I was trying to illustrate the fact that the sixth row should be grouped w/ group_id = 1. I have fixed it in my question. – Andrew Park Jul 12 '22 at 21:48
  • now it should be group_id 4 – nbk Jul 12 '22 at 21:50

1 Answers1

3

Edit: I had previously answered this for the CONDITIONAL_CHANGE_EVENT, which is a bit more challenging to express in ANSI SQL. This updated answer is for CONDITIONAL_TRUE_EVENT as the question asks.

It is simply a matter of conditional summing in the window function.

create or replace table T1(PK int, EVNT string);
    
insert into T1(PK, EVNT) values 
    (1, 'A'), (2, 'C'), (3, 'B'), (4, 'A'), (5, 'A'), 
    (6, 'C'), (7, 'C'), (8, 'A'), (9, 'D'), (10, 'A');
select 
    PK, 
    conditional_true_event(EVNT = 'A') over (partition by null order by PK) 
from T1; 
PK CONDITIONAL_TRUE_EVENT(EVNT = 'A') OVER (PARTITION BY NULL ORDER BY PK)
1 1
2 1
3 1
4 2
5 3
6 3
7 3
8 4
9 4
10 5
select 
   PK, 
   sum(iff(EVNT = 'A', 1, 0)) over (partition by null order by PK) as TRUE_EVENT
from T1;
PK TRUE_EVENT
1 1
2 1
3 1
4 2
5 3
6 3
7 3
8 4
9 4
10 5

So for your query you would replace iif(EVNT = 'A', 1, 0) in the conditional sum with iif(TIME_DIFF > 300, 1, 0)

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
Greg Pavlik
  • 10,089
  • 2
  • 12
  • 29
  • Thank you @Greg Pavlik. Trying it now! So, would I update the PARTITION BY NULL to accommodate the requirement to group by user_id & device_id? – Andrew Park Jul 12 '22 at 20:53
  • Thank you @GregPavlik. Trying it now! So, would I update the ```PARTITION BY NULL``` to accommodate the requirement to group by user_id & device_id? – Andrew Park Jul 12 '22 at 20:59
  • I used `partition by null` simply because I didn't put a partitioning key in the very simple table. You can partition by anything you'd normally partition a window function. There's nothing special about `partition by null` except it tells the window function that the partition is the whole table (or selected rows at least). – Greg Pavlik Jul 12 '22 at 21:04