I have to rewrite a script written for Snowflake into Databricks and need some help on how to replicate CONDITIONAL_TRUE_EVENT as Databricks doesn't have that function.
The need is for me to group events together if they have the same user and device and took place within 300 seconds (5 minutes) of each other.
CREATE TABLE events
(
event_timestamp timestamp,
user_id bigint,
device_id bigint
);
INSERT INTO events VALUES
('2022-07-12 05:00:00',1,1),
('2022-07-12 05:03:00',1,1),
('2022-07-12 05:04:00',1,2),
('2022-07-12 05:05:00',1,2),
('2022-07-12 05:06:00',2,1),
('2022-07-12 05:07:00',1,1),
('2022-07-12 05:15:00',1,1);
SELECT event_timestamp, user_id, device_id, group_id
FROM events
should return
'2022-07-12 05:00:00',1,1,1
'2022-07-12 05:03:00',1,1,1
'2022-07-12 05:04:00',1,2,2
'2022-07-12 05:05:00',1,2,2
'2022-07-12 05:06:00',2,1,3
'2022-07-12 05:07:00',1,1,1
'2022-07-12 05:15:00',1,1,4
The first 3 instances where user_id = 1, device_id = 1 are all group_id = 1 because the next event is within 5 minute of the previous event except for the last one because (group_id = 4) because at 05:15:00 it is more than 5 minutes away from the previous event with user_id = 1, device_id = 1 (05:07:00).
It seems to me that I should be able to find some combination of LAG, CASE, and SUM to calculate the group_id, but I just cannot figure it out.