1

I have a table like below, containing a group_id and some value.

group_id | value
---------+-------
1        | A
1        | B
2        | C
2        | D
2        | A
3        | E
3        | C
4        | G
4        | H

What I want to get is a unique number for each group that is somehow connected.. like this:

Group 1 and 2 have a common element A, Group 1 and 3 have a common element C > so this is actually one big group.

master_id | group_id | value
----------+----------+--------
1         | 1        | A
1         | 1        | B
1         | 2        | C
1         | 2        | D
1         | 2        | A
1         | 3        | E
1         | 3        | C
2         | 4        | G
2         | 4        | H

How can I get this master_id?

GMB
  • 216,147
  • 25
  • 84
  • 135
Dextrious
  • 23
  • 4
  • Please tag your question with the database that you are using: mysql, oracle, postgresql...? – GMB Mar 01 '20 at 20:57

2 Answers2

4

Calculating the master group is a graph-walking problem, which implies a recursive CTE. I would approach this by:

  • Generating edges between the groups, based on the values.
  • Traversing the edges without visiting previous groups.

The calculation of the master group is then the minimum of the visited groups for each group.

In SQL, this looks like:

with edges as (
      select distinct t1.group_id as group_id_1, t2.group_id as group_id_2
      from t t1  join
           t t2
           on t1.value = t2.value
     ),
     cte as (
      select e.group_id_1, e.group_id_2, convert(varchar(max), concat(',', group_id_1, ',', group_id_2)) as visited, 1 as lev
      from edges e
      union all
      select cte.group_id_1, e.group_id_2, 
             concat(visited, e.group_id_2, ','), lev + 1
      from cte join
           edges e
           on e.group_id_1 = cte.group_id_2
      where cte.visited not like concat('%,', e.group_id_2, ',%') and lev < 5
     )
select group_id_1, dense_rank() over (order by min(group_id_2)) as master_group
from cte
group by group_id_1;

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Sql server example:

WITH GetConnected AS (
  SELECT DISTINCT g1.group_id sourceGroup, g2.group_id connectedGroup
  FROM @groups g1
  LEFT JOIN @groups g2
    ON g1.value = g2.value
  UNION ALL
  SELECT g1.group_id sourceGroup, g3.connectedGroup connectedGroup
  FROM @groups g1
  INNER JOIN @groups g2
    ON g1.value = g2.value
    AND g1.group_id < g2.group_id
  INNER JOIN GetConnected g3
    ON g3.sourceGroup = g2.group_id
    AND g3.connectedGroup > g2.group_id
), GetGroups AS (
  SELECT MIN(sourceGroup) sourceGroup, connectedGroup, DENSE_RANK() OVER (ORDER BY MIN(sourceGroup)) rk
  FROM GetConnected
  GROUP BY connectedGroup)

SELECT gg.rk master_id, g.group_id, g.value
FROM GetGroups gg
INNER JOIN @groups g
  ON gg.connectedGroup = g.group_id
ORDER BY gg.rk, gg.connectedGroup, g.value

If you consider postgre, I have example code:

WITH RECURSIVE GetConnected AS (
  SELECT DISTINCT g1.group_id sourceGroup, g2.group_id connectedGroup
  FROM groups g1
  LEFT JOIN groups g2
    ON g1.value = g2.value
  UNION
  SELECT g1.group_id sourceGroup, g3.connectedGroup connectedGroup
  FROM groups g1
  LEFT JOIN groups g2
    ON g1.value = g2.value
  INNER JOIN GetConnected g3
    ON g3.sourceGroup = g2.group_id
), GetGroups AS (
  SELECT MIN(sourceGroup) sourceGroup, connectedGroup, DENSE_RANK() OVER (ORDER BY MIN(sourceGroup)) rk
  FROM GetConnected
  GROUP BY connectedGroup)

SELECT gg.rk master_id, g.group_id, g.value
FROM GetGroups gg
INNER JOIN groups g
  ON gg.connectedGroup = g.group_id
ORDER BY gg.rk, gg.connectedGroup, g.value
Leszek Mazur
  • 2,443
  • 1
  • 14
  • 28