I have a database table of user interactions. I would like to create groups of users based on the time & place of interaction. That is, if users are interacting at roughly the same time (e.g., 2 minute window) in the same location, I would consider them a group. The groups do not need to be mutually exclusive, but they do need to be exhaustive. Every user interaction belongs in one or more groups.
I've done something similar to this in the past with python and a disjoint set. But now I am limited to a SQL solution.
Assume a toy data table like
create table example_intxns (
intxn_date DATE,
loc_id number,
intxn_timestamp timestamp,
user_id varchar(100)
);
insert into example_intxns (intxn_date, loc_id, intxn_timestamp, user_id)
values
('2021-01-01', 1, '2021-01-01 08:00:00', 'a'),
('2021-01-01', 1, '2021-01-01 08:01:00', 'b'),
('2021-01-01', 1, '2021-01-01 08:02:00', 'c'),
('2021-01-01', 1, '2021-01-01 08:04:00', 'd'),
('2021-01-01', 1, '2021-01-01 08:05:00', 'e'),
('2021-01-01', 1, '2021-01-01 08:07:00', 'f'),
('2021-01-01', 1, '2021-01-01 08:10:00', 'g'),
('2021-01-01', 1, '2021-01-01 08:02:00', 'h')
;
I can create pairs of users who interact within 2 minutes of each other like so
select distinct
a.intxn_date,
a.loc_id,
a.user_id as seed_user_id,
b.user_id
from
example_intxns a
inner join
example_intxns b
on a.intxn_date = b.intxn_date
and a.loc_id = b.loc_id
and timestampdiff(minute, a.intxn_timestamp, b.intxn_timestamp) between -2 and 2
and b.user
This returns a result set of all user pairs based on the same location and an interaction window of +/- 2 minutes. A user will always pair with themself- that's useful for when there is no one else interacting at the same time & place. And a user can pair with another user. Here, I use the b.user_id >= a.user_id
condition because A->B is equivalent to B->A. I don't need both.
But here is where I'm stuck. I don't know how to extend pairs to groups in SQL-way. It feels like it could be a recursive problem?
I think what I want- and I'm still thinking through edge cases- is a result set that looks something like
Date | Location | GroupID | GroupMember |
---|---|---|---|
2021-01-01 | 1 | 1 | a |
2021-01-01 | 1 | 1 | b |
2021-01-01 | 1 | 1 | c |
2021-01-01 | 1 | 1 | h |
2021-01-01 | 1 | 2 | c |
2021-01-01 | 1 | 2 | h |
2021-01-01 | 1 | 2 | d |
2021-01-01 | 1 | 3 | d |
2021-01-01 | 1 | 3 | e |
2021-01-01 | 1 | 4 | e |
2021-01-01 | 1 | 4 | f |
2021-01-01 | 1 | 5 | g |
This effectively tells me that on this date, at this location, I have 5 groups: [a,b,c,h], [c,h,d], [d,e], [e,f] and [g]
Complications for a solution include the same user interacting multiple times on the same day in the same location. Not present in the toy example, but possible in real data.
I'm working in Snowflake, if that makes a difference. The real world problem includes 10s of thousands of users every day, at a half-dozen locations, and ~5 years interactions.
EDIT: I moved the goal posts a bit on the description of what I wanted. My desired output reflected the need to exclude group occurrences that were a subset of an earlier group. That is, [b,c] is a valid group, but it's covered by the [a,b,c,h] group. But I didn't specifically call this criteria out. My final solution, which covers this additional requirement, is as follows:
with pairs as ( -- qualified pairs
select distinct
a.intxn_date,
a.loc_id,
a.user_id as seed_user_id,
a.intxn_timestamp as seed_intxn_timestamp,
b.user_id,
b.intxn_timestamp as intxn_timestamp,
dense_rank() over (partition by a.intxn_date, a.loc_id order by seed_user_id) as group_id
from
example_entries a
inner join
example_entries b
on a.intxn_date = b.intxn_date
and a.loc_id = b.loc_id
and timestampdiff(second, a.intxn_timestamp, b.intxn_timestamp) between 0 and 120
and b.user_id>=a.user_id
),
groups as ( -- qualified pairs converted into groups
select p.*,
max(p.intxn_timestamp)over(partition by p.group_id) as max_group_intxn_timestamp
from pairs p
),
subtracts as ( -- groups already completely assumed in another, earlier gruop
select a.*
from groups a -- what we want to subtract
inner join groups b
on a.seed_user_id = b.user_id
and a.loc_id = b.loc_id
and a.intxn_date = b.intxn_date
and a.group_id > b.group_id
and timestampdiff(second, a.seed_intxn_timestamp, b.intxn_timestamp) between 0 and 120
inner join groups c
on b.loc_id = c.loc_id
and b.intxn_date = c.intxn_date
and b.seed_intxn_timestamp = c.seed_intxn_timestamp
and b.group_id = c.group_id
and c.user_id >= b.user_id
and b.seed_user_id = c.seed_user_id
and a.user_id = c.user_id
and a.max_group_intxn_timestamp <= c.max_group_intxn_timestamp
)
select distinct a.intxn_date, a.loc_id, a.group_id, a.user_id, a.seed_intxn_timestamp as group_intxn_window_start_timestamp, a.max_group_intxn_timestamp as group_intxn_window_end_timestamp
from groups a
left join (select distinct intxn_date, loc_id, seed_intxn_timestamp, group_id from subtracts) b
on a.intxn_date = b.intxn_date
and a.loc_id = b.loc_id
and a.seed_intxn_timestamp = b.seed_intxn_timestamp
and a.group_id = b.group_id
where b.group_id is null
/*minus -- could have used a MINUS, but I think the LEFT JOIN way is safer because it removes all traces of the unqualified group
select distinct intxn_date, loc_id, group_id, user_id, seed_intxn_timestamp
from subtracts a*/
order by 1,2,3,4