1

Looking for some query structuring help. I have a table with rows for link timestamp, user_id, linked_id, type_if_link. These link types are for example 'email' vs. 'phone number' so in the example below you can see user 1 is not directly connected to user 3 but is via user 2. The other complication is that each 'linked account' appears in r1 as well, meaning there are several 'duplicate' fields (in the example: row 1+2 , row 3+4)

ex:

Link time          user id   linked_id   link type
---------------------------------------------------
link_occurred at   user 1    user 2      link a 
link_occurred at   user 2    user 1      link a
link_occurred at   user 2    user 3      link b
link_occurred at   user 3    user 2      link b 
link_occurred_at   user 4    user 5      link a
link_occurred_at   user 5    user 4      link a

What functions could I use to get the first user-id, a count of all the (directly+indirectly) linked accounts and possibly an array of the linked account ids.

For example the output I would want here is:

initial user - Count linked accounts  array of linked accounts 
--------------------------------------------------------------
user 1         2 linked               [user 2, user 3]
user 4         1 linked account       [user 5]

This would give me mutually exclusive grouping of all linked networks of accounts.

  • *Can* be done with a recursive CTE. But this class of problems is not well suited for SQL. We don't know where to start, so we have to keep track of all rows that are already part of a network. Does not scale well with big tables ... – Erwin Brandstetter Jun 12 '20 at 22:43

1 Answers1

0

I didn't know about recursive CTEs until Erwin Brandstetter mentioned them in the comment above. The concept is what it sounds like: a CTE that refers to itself, and has a base case so that recursion terminates. For your problem, a recursive CTE solution might look something like:

WITH accumulate_users AS (
  -- Base case: the direct links from a user_id.
  SELECT
    user_id AS user_id, 
    ARRAY_AGG(linked_id) AS linked_accounts
  FROM your_table
  GROUP BY user_id

  UNION ALL

  -- Recursive case: transitively linked accounts.
  SELECT
    ARRAY_UNION(
      accumulate_users.linked_accounts,
      ARRAY_AGG(DISTINCT your_table.linked_id)
    ) AS linked_accounts
  FROM accumulate_users
  JOIN your_table ON CONTAINS(accumulate_users.linked_accounts, your_table.user_id)
  GROUP BY accumulate_users.user_id

  -- But there is no enforced termination condition, hopefully it just
  -- ends at some point? This is part of why implementing recursive CTEs
  -- is challenging, I think.
)
SELECT
  user_id,
  CARDINALITY(linked_accounts) AS count_linked_accounts,
  linked_accounts
FROM accumulate_users

But, I haven't been able to test this query, because as detailed in another Stack Overflow Q&A Presto does not support recursive CTEs.

It is possible to traverse an arbitrary, but finite, number of links by repeatedly joining back to the table you have. Something like this, and I've included the second_, third_, fourth_degree_links only for clarity:

SELECT
  yt1.user_id,
  ARRAY_AGG(DISTINCT yt2.user_id) AS first_degree_links,
  ARRAY_AGG(DISTINCT yt3.user_id) AS second_degree_links,
  ARRAY_AGG(DISTINCT yt3.linked_user) AS fourth_degree_links,
  ARRAY_UNION(
    ARRAY_AGG(DISTINCT yt2.user_id), 
    ARRAY_UNION(ARRAY_AGG(DISTINCT yt3.user_id), ARRAY_AGG(DISTINCT yt3.linked_user))
  ) AS up_to_fourth_degree_links
FROM your_table AS yt1
JOIN your_table AS yt2 ON yt1.linked_user = yt2.user_id
JOIN your_Table AS yt3 ON yt2.linked_user = yt3.user_id
GROUP BY yt1.user_id

I've been working with a similar set of data, although I have the original identifiers as part of the raw data set. In other words the 'email' and 'phone number' in your example. I found it helpful to create a table that groups user ids by these connecting identifiers:

CREATE TABLE email_connections AS
SELECT
  email,
  ARRAY_AGG(DISTINCT user_id) AS users
FROM source_table
GROUP BY email

The same arbitrary-but-finite-depth set of links can then be computed by looking for intersections between the user arrays:

SELECT
    3764350 AS user_id,
    FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users))))) AS all_users,
    CARDINALITY(FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users)))))) AS count_all_users
FROM email_connections AS emails1
JOIN email_connections AS emails2 ON CARDINALITY(ARRAY_INTERSECT(emails1.users, emails2.users)) > 0
JOIN email_connections AS emails3 ON CARDINALITY(ARRAY_INTERSECT(emails2.users, emails3.users)) > 0
JOIN email_connections AS emails4 ON CARDINALITY(ARRAY_INTERSECT(emails3.users, emails4.users)) > 0
WHERE CONTAINS(emails1.users, 3764350)
GROUP BY 1

Calculating links to an arbitrary depth is a good use case for a graph database technology like Neo4j or JanusGraph. That's what I'm now looking at to address this "user linking" problem.

Dave Cameron
  • 2,110
  • 2
  • 19
  • 23