4

I have a database table of user interactions. I would like to create groups of users based on the time & place of interaction. That is, if users are interacting at roughly the same time (e.g., 2 minute window) in the same location, I would consider them a group. The groups do not need to be mutually exclusive, but they do need to be exhaustive. Every user interaction belongs in one or more groups.

I've done something similar to this in the past with python and a disjoint set. But now I am limited to a SQL solution.

Assume a toy data table like

create table example_intxns (
  intxn_date DATE, 
  loc_id number,
  intxn_timestamp timestamp,
  user_id varchar(100)
);

insert into example_intxns (intxn_date, loc_id, intxn_timestamp, user_id)
values
('2021-01-01', 1, '2021-01-01 08:00:00', 'a'),
('2021-01-01', 1, '2021-01-01 08:01:00', 'b'),
('2021-01-01', 1, '2021-01-01 08:02:00', 'c'),
('2021-01-01', 1, '2021-01-01 08:04:00', 'd'),
('2021-01-01', 1, '2021-01-01 08:05:00', 'e'),
('2021-01-01', 1, '2021-01-01 08:07:00', 'f'),
('2021-01-01', 1, '2021-01-01 08:10:00', 'g'),
('2021-01-01', 1, '2021-01-01 08:02:00', 'h')
;

I can create pairs of users who interact within 2 minutes of each other like so

select distinct 
    a.intxn_date, 
    a.loc_id,    
    a.user_id as seed_user_id, 
    b.user_id

from 
    example_intxns a
inner join 
    example_intxns b
        on a.intxn_date = b.intxn_date
        and a.loc_id = b.loc_id
        and timestampdiff(minute, a.intxn_timestamp, b.intxn_timestamp) between -2 and 2
        and b.user

This returns a result set of all user pairs based on the same location and an interaction window of +/- 2 minutes. A user will always pair with themself- that's useful for when there is no one else interacting at the same time & place. And a user can pair with another user. Here, I use the b.user_id >= a.user_id condition because A->B is equivalent to B->A. I don't need both.

But here is where I'm stuck. I don't know how to extend pairs to groups in SQL-way. It feels like it could be a recursive problem?

I think what I want- and I'm still thinking through edge cases- is a result set that looks something like

Date Location GroupID GroupMember
2021-01-01 1 1 a
2021-01-01 1 1 b
2021-01-01 1 1 c
2021-01-01 1 1 h
2021-01-01 1 2 c
2021-01-01 1 2 h
2021-01-01 1 2 d
2021-01-01 1 3 d
2021-01-01 1 3 e
2021-01-01 1 4 e
2021-01-01 1 4 f
2021-01-01 1 5 g

This effectively tells me that on this date, at this location, I have 5 groups: [a,b,c,h], [c,h,d], [d,e], [e,f] and [g]

Complications for a solution include the same user interacting multiple times on the same day in the same location. Not present in the toy example, but possible in real data.

I'm working in Snowflake, if that makes a difference. The real world problem includes 10s of thousands of users every day, at a half-dozen locations, and ~5 years interactions.

EDIT: I moved the goal posts a bit on the description of what I wanted. My desired output reflected the need to exclude group occurrences that were a subset of an earlier group. That is, [b,c] is a valid group, but it's covered by the [a,b,c,h] group. But I didn't specifically call this criteria out. My final solution, which covers this additional requirement, is as follows:

with pairs as ( -- qualified pairs
select distinct 
    a.intxn_date, 
    a.loc_id,    
    a.user_id as seed_user_id,
    a.intxn_timestamp as seed_intxn_timestamp,
    b.user_id,
    b.intxn_timestamp as intxn_timestamp,
    dense_rank() over (partition by a.intxn_date, a.loc_id order by seed_user_id) as group_id

from 
    example_entries a
inner join 
    example_entries b
        on a.intxn_date = b.intxn_date
        and a.loc_id = b.loc_id
        and timestampdiff(second, a.intxn_timestamp, b.intxn_timestamp) between 0 and 120
        and b.user_id>=a.user_id
),
groups as ( -- qualified pairs converted into groups
select p.*,
    max(p.intxn_timestamp)over(partition by p.group_id) as max_group_intxn_timestamp
from pairs p
),
subtracts as ( -- groups already completely assumed in another, earlier gruop
select a.*
from groups a -- what we want to subtract
inner join groups b
  on a.seed_user_id = b.user_id
  and a.loc_id = b.loc_id
  and a.intxn_date = b.intxn_date
  and a.group_id > b.group_id
  and timestampdiff(second, a.seed_intxn_timestamp, b.intxn_timestamp) between 0 and 120
inner join groups c
  on b.loc_id = c.loc_id
  and b.intxn_date = c.intxn_date
  and b.seed_intxn_timestamp = c.seed_intxn_timestamp
  and b.group_id = c.group_id
  and c.user_id >= b.user_id
  and b.seed_user_id = c.seed_user_id
  and a.user_id = c.user_id
  and a.max_group_intxn_timestamp <= c.max_group_intxn_timestamp
)
select distinct a.intxn_date, a.loc_id, a.group_id, a.user_id, a.seed_intxn_timestamp as group_intxn_window_start_timestamp, a.max_group_intxn_timestamp as group_intxn_window_end_timestamp
from groups a
left join (select distinct intxn_date, loc_id, seed_intxn_timestamp, group_id from subtracts) b
on a.intxn_date = b.intxn_date
and a.loc_id = b.loc_id
and a.seed_intxn_timestamp = b.seed_intxn_timestamp
and a.group_id = b.group_id
where b.group_id is null
/*minus -- could have used a MINUS, but I think the LEFT JOIN way is safer because it removes all traces of the unqualified group
select distinct intxn_date, loc_id, group_id, user_id, seed_intxn_timestamp
from subtracts a*/
order by 1,2,3,4
Amw 5G
  • 659
  • 5
  • 16
  • a thing to note, `timestampdiff` & `datediff` are the same, and `datediff` gives you the difference in the unit you ask for, thus `2021-01-07 12:00:59` and `2021-0107 12:02:00` are 1min 1sec apart, but `2 min` in datediff minute space. The safest method is to extract the grain you care for (aka seconds or milliseconds) and subtract those two epoch times, and then compare against `2*60` if seconds for example. – Simeon Pilgrim Jan 07 '22 at 09:50
  • It's a fair point, @SimeonPilgrim I was so focused on the grouping logic, I didn't think through this finer detail. Thank you for bringing it to my attention! – Amw 5G Jan 07 '22 at 17:59

3 Answers3

3

I modified your query a little, and then got group ids using RANK():

select *, rank() over(partition by intxn_date order by seed_timestamp) grp_id
from (
    select a.intxn_date, a.loc_id, a.intxn_timestamp as seed_timestamp, b.user_id
    from  example_entries a
    inner join example_entries b
    on a.intxn_date = b.intxn_date
    and a.loc_id = b.loc_id
    and timestampdiff(minute, a.intxn_timestamp, b.intxn_timestamp) between 0 and 2
    and iff(a.intxn_timestamp=b.intxn_timestamp, a.user_id>=b.user_id, true)
)

enter image description here

The group ids are non-consecutive, but you can see that there are 7 different groups, with each group having at least one element. Some rows belong to many groups, but there's no row without a group.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • It's much closer than I got on my on :). The shortcoming I see in the proposed approach is that every user interaction is the nucleus of another group. For example, Group 1 is generated from a's interaction; Group 5 is b's interaction, and so on. Even though b is temporally linked with c and h in Group 5, those pairs are already covered by group 1. So Group 5 shouldn't exist. I failed to explain in my Q and made you infer it from the output, my bad :/. I think I need to start with your suggestion as a CTE, and then figure out the appropriate MINUS to subtract out the redundancies. – Amw 5G Jan 07 '22 at 17:35
  • If this solves the problem in the question, please accept the answer. Happy to dig further on a new question ;) – Felipe Hoffa Jan 07 '22 at 18:29
2

You're close. Just add this to your existing select statement to assign group ids to user interactions. It results in 8 groups, but that's because it's allowing groups to be mutually inclusive such that users can cross with other users multiple times a day (permitting +-2 minutes diff). It also allows users to cross with themselves irrespective of whether they also crossed with a different user.

Since you're still thinking about the edge cases, I think this is a good/flexible start and should get you past the grouping hurdle. Make sure you're data was deduped before the join

dense_rank() over (partition by a.intxn_date, a.loc_id order by a.user_id) as group_id
Radagast
  • 5,102
  • 3
  • 12
  • 27
  • Ah, using a ranking function; I don't know why that didn't occur to me for setting the group ID. I was going to do some weird hashing thing. I appreciate the advice! – Amw 5G Jan 07 '22 at 17:58
2

Another approach to this might be as follows:

Firstly get all possible location|time combinations (aka groupys).

Simply left outer join. Done!

I also noticed in the 'correct' answer the # of users at 8:02 should be 3 not the 5 reported.

enter image description here

Code|copy|paste|run:

select 
  groupys.groups 
, groupys.loc_id
, example_intxns.user_id
, groupys.starts
 from
   ( select  
       row_number() over (order by intxn_timestamp) groups 
     , loc_id
     , intxn_timestamp  starts
     , intxn_timestamp + interval '2 minutes'  ends 
    from 
      example_intxns 
   group by 
      2,3,4)  groupys
left outer join 
    example_intxns 
 on example_intxns.loc_id = groupys.loc_id 
 and example_intxns.intxn_timestamp between groupys.starts and groupys.ends
Adrian White
  • 1,720
  • 12
  • 14
  • That's a new way of looking at the problem. And quite clever! Create all the ranges, then see which users fall into those intervals based on interaction time. Thank you for the suggestion! – Amw 5G Jan 10 '22 at 19:37
  • Yes I think it's a simpler way to phrase the problem and makes the maintenance somewhat easier. It also removes the duplicates seen in your official answer at 8:02am (where c and d are incorrectly repeated). – Adrian White Jan 10 '22 at 20:10