BigQuery recursively join based on links between 2 ID columns

Question

Given a table representing a many-many join between IDs like the following:

WITH t AS (
  SELECT 1 AS id_1, 'a' AS id_2,
  UNION ALL SELECT 2, 'a'
  UNION ALL SELECT 2, 'b'
  UNION ALL SELECT 3, 'b'
  UNION ALL SELECT 4, 'c'
  UNION ALL SELECT 5, 'c'
  UNION ALL SELECT 6, 'd'
  UNION ALL SELECT 6, 'e'
  UNION ALL SELECT 7, 'f'
)

SELECT * FROM t

id_1	id_2
1	a
2	a
2	b
3	b
4	c
5	c
6	d
6	e
7	f

I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:

The desired output for the example above would look something like this:

id_1_coll	id_2_coll
1, 2, 3	a, b
4, 5	c
6	d, e
7	f

where each row contains all the other IDs one could reach following the links in the table.

Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.

You are looking for a " disjoint-set union (DSU)". For a small amount of rows, this can be solved. https://stackoverflow.com/questions/74157658/create-a-group-of-linked-items — Samuel, Dec 16 '22 at 15:57

Stewart_R · Answer 1 · 2023-01-12T08:05:50.270

One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.

Explanation

Our steps are

Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function

We can use something like this to remodel the relationships into a self join (1.):


SELECT 
  a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
  INNER JOIN t as b 
    ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1

Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:

WITH RECURSIVE base_iter AS (
  SELECT 
    a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
  FROM t as a
    INNER JOIN t as b 
    ON a.id_2 = b.id_2
  WHERE a.id_1 != b.id_1
)

We can also grab the lowest id_1 value at this time:

id_1	linked_id	lowest_linked_id
1	2	1
2	1	1
2	3	2
3	2	2
4	5	4
5	4	4

For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.

We can code this as follows:


recursive_loop AS (
  SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
  FROM base_iter
  UNION ALL
    SELECT 
      prev_iter.id_1,  prev_iter.linked_id,
      iter.lowest_linked_id,
      ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
    FROM base_iter AS prev_iter
    JOIN recursive_loop AS iter
      ON iter.id_1 = prev_iter.linked_id
      AND iter.lowest_linked_id <  prev_iter.lowest_linked_id
      AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )      
)

Giving us the following results: |id_1|linked_id|lowest_linked_id|linked_ids| |----|---------|------------|---| |3|2|1|[1,2]| |2|3|1|[1,2,3]| |4|5|4|[5]| |1|2|1|[2]| |5|4|4|[4]| |2|3|2|[3]| |2|1|1|[1]| |3|2|2|[2]|

which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below

Solution

WITH RECURSIVE t AS (
  SELECT 1 AS id_1, 'a' AS id_2,
  UNION ALL SELECT 2, 'a'
  UNION ALL SELECT 2, 'b'
  UNION ALL SELECT 3, 'b'
  UNION ALL SELECT 4, 'c'
  UNION ALL SELECT 5, 'c'
  UNION ALL SELECT 6, 'd'
  UNION ALL SELECT 6, 'e'
  UNION ALL SELECT 7, 'f'
),

base_iter AS (
  SELECT 
    a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
  FROM t as a
    INNER JOIN t as b 
    ON a.id_2 = b.id_2
  WHERE a.id_1 != b.id_1
),

recursive_loop AS (
  SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
  FROM base_iter
  UNION ALL
    SELECT 
      prev_iter.id_1,  prev_iter.linked_id,
      iter.lowest_linked_id,
      ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
    FROM base_iter AS prev_iter
    JOIN recursive_loop AS iter
      ON iter.id_1 = prev_iter.linked_id
      AND iter.lowest_linked_id <  prev_iter.lowest_linked_id
      AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )

      
),

link_back AS (
  SELECT 
    t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
  FROM t
    LEFT JOIN recursive_loop
    ON t.id_1 = recursive_loop.id_1
),

by_id_1 AS (
  SELECT 
    id_1,
    MIN(lowest_linked_id) AS grp

  FROM link_back
    GROUP BY 1
),

by_id_2 AS (
  SELECT 
    id_2,
    MIN(lowest_linked_id) AS grp

  FROM link_back
    GROUP BY 1
),

result AS (
  SELECT 
    by_id_1.grp,
    ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
    ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
  FROM 
    by_id_1
    INNER JOIN by_id_2
    ON by_id_1.grp = by_id_2.grp
  GROUP BY grp
)

SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll  
FROM result ORDER BY grp

Giving us the required output:

grp	id1_coll	id2_coll
1	[1,2,3]	[a,b]
4	[4,5]	[c]
6	[6]	[d,e]
7	[7]	[f]

Limitations/Issues

Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:

Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.

I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).

My revision was rejected...was pretty sure you meant "FROM result" in your final query — JJ32, Jan 11 '23 at 16:45

etsuhisa · Answer 2 · 2022-12-29T00:38:14.723

Using ROW_NUMBER() the query is as the follow:

WITH RECURSIVE
t AS (
  SELECT 1 AS id_1, 'a' AS id_2,
  UNION ALL SELECT 2, 'a'
  UNION ALL SELECT 2, 'b'
  UNION ALL SELECT 3, 'b'
  UNION ALL SELECT 4, 'c'
  UNION ALL SELECT 5, 'c'
  UNION ALL SELECT 6, 'd'
  UNION ALL SELECT 6, 'e'
  UNION ALL SELECT 7, 'f'
),
t1 AS (
  SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
  SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
    WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
  UNION ALL
  SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
    CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
         THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
         ELSE t2.arr_1 END,
    CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
         THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
         ELSE t2.arr_2 END,
    t1.id_1, t1.id_2
    FROM t2 JOIN t1 ON
      t2.n < t1.n AND
      t1.n NOT IN UNNEST(t2.n_arr) AND
      (t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
      (t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
  SELECT
    n,
    ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
    ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
  FROM t2
  WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
  GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n

t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().

However, it may not help your Limitations/Issues.

JJ32 · Answer 3 · 2023-01-03T17:38:51.583

The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):

SELECT GrpNr,
  STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
  STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
  SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END) 
    OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
  ROW_NUMBER() OVER () as RowNr,
  LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
  LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr

I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from @GordonLinoff to the question here, which I tested (and presume works for BigQuery).

This can also be done using sequential updates as done by @RomanPekar here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.

Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.

I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

BigQuery recursively join based on links between 2 ID columns

3 Answers3

Explanation

Solution

Limitations/Issues