How to identify groups/clusters in set of arcs/edges in SQL?

Question

I have arcs/edges like these ‘haves’:

Node1   Node2
A       B
B       C
D       E

Here A is connected to B and B to C. D is connected to E. In other words there are 2 groups/clusters shown in these ‘wants’:

Node1   Node2   Cluster
A       B       1
B       C       1
D       E       2

Could I use SQL to identify these groups/clusters? I guess this involves self-joins but I cannot see how to write this SQL. Any feedback would be very much appreciated. Thanks!

The key is recursive CTEs. The answer depends highly on whether or not you have cycles. — Gordon Linoff, Jul 20 '17 at 11:03
Have used recursive CTEs in the past in SQL Server. Thanks for reminding me about this possibility. Yes I will have cycles ... not sure what to do. BTW can you use recursive CTEs in TeraData? — cs0815, Jul 20 '17 at 11:17
Recursion in Teradata uses the same Standard SQL syntax as SQL Server (you just have to write `WITH RECURSIVE ...`. Is this similar to this question https://stackoverflow.com/q/40204509/2527905? — dnoeth, Jul 20 '17 at 11:25
Ok thanks. I have notice that I can only have one CTE in TeraData. I used to break down my queries in several CTEs in SQL. Thanks. — cs0815, Jul 20 '17 at 11:30
No, you can have multiple (since TD14), but they skrewed it up (TD16 finally fixed this). You have to list nested CTEs in opposite order. — dnoeth, Jul 20 '17 at 11:43
Whoa, they finally fixed the backwards CTE thing? I wonder when we're upgrading... — Andrew, Jul 20 '17 at 14:11
Seems to be similar to this question: https://stackoverflow.com/questions/40204509/connected-components — dnoeth, Aug 13 '17 at 19:24
@csetzkorn, can you kindly confirm the max depth of cluster? — zarruq, Aug 15 '17 at 07:24
@zarruq sorry I do not understand what you mean. Is this important for the recursion depth of the CTEs? — cs0815, Aug 15 '17 at 11:38
yes, we need to restrict the recursion (just like that standard one) at some point. If you are not sure about it, its okay. — zarruq, Aug 15 '17 at 12:04
@csetzkorn: Maybe create a fiddle and add some more rows to it. — dnoeth, Aug 16 '17 at 13:26
@csetzkorn I think you want hierarchical data. https://learn.microsoft.com/en-us/sql/relational-databases/hierarchical-data-sql-server — LAS, Aug 17 '17 at 00:07
https://stackoverflow.com/a/45650663/905902 (and there will probably be more prior art) — wildplasser, Aug 20 '17 at 14:30

zarruq · Accepted Answer · 2017-08-20T14:16:19.773

Assuming from the sample data provided in question, that there is no loops like A->B->C->A in data, Following is the query that will return the desired output from nodes table.

WITH RECURSIVE NodeCluster (node1,node2,Cluster1) AS
      (SELECT node1,
              node2,
              Rank() Over (
                           ORDER BY node1)
       FROM nodes AS n1
       WHERE NOT EXISTS
           (SELECT *
            FROM nodes AS n2
            WHERE n1.node1 = n2.node2)
       UNION ALL SELECT N1.node1,
                        N1.node2,
                        NodeCluster.Cluster1
       FROM nodes n1,
            NodeCluster
       WHERE NodeCluster.node2=n1.node1 )
    SELECT *
    FROM NodeCluster
    ORDER BY Cluster1,
             node1,
             node2;

In seed query all the starting nodes are selected and ranked in asc to assign Cluster number to data in ascending order.

On the data provided in question, following is the output.

Node1 | Node2  | Cluster1
-------------------------
A       B        1
B       C        1
D       E        2

For re-assurance, more data has been added to the sample data as below.

Node1 | Node2
-------------
A        B   
B        C   
D        E   
E        F   
F        G   
H        I   
I        J   
J        K   
L        M

The query resulted in below output.

Node1 | Node2  | Cluster1
-------------------------
A       B         1
B       C         1
D       E         2
E       F         2
F       G         2
H       I         3
I       J         3
J       K         3
L       M         4

The solution query is successfully tested both with Teradata SQL Assistant and bteq in teradata and ANSI modes.

Hope this will help.

Ali Adlavaran · Answer 2 · 2017-08-18T22:27:23.487

Try this:

    DECLARE @Edges TABLE
    (
      ID INT ,
      Node1 CHAR(1) ,
      Node2 CHAR(1)
    );
INSERT  INTO @Edges
        ( Node1, Node2 )
VALUES  ( 'A', 'B' ),
        ( 'B', 'C' ),
        ( 'D', 'E' );
WITH    CTE_Nodes
          AS ( SELECT   Node1 AS Node
               FROM     @Edges
               UNION
               SELECT   Node2 AS Node
               FROM     @Edges
             ),
        CTE_Pairs
          AS ( SELECT   Node1 ,
                        Node2
               FROM     @Edges
               WHERE    Node1 <> Node2
               UNION
               SELECT   Node2 AS Node1 ,
                        Node1 AS Node2
               FROM     @Edges
               WHERE    Node1 <> Node2
             ),
        CTE_Recursive
          AS ( SELECT   CAST(CTE_Nodes.Node AS VARCHAR(8000)) AS AnchorNode ,
                        Node1 ,
                        Node2 ,
                        CAST(',' + Node1 + ',' + Node2 + ',' AS VARCHAR(8000)) AS NodePath ,
                        1 AS Lvl
               FROM     CTE_Pairs
                        INNER JOIN CTE_Nodes ON CTE_Nodes.Node = CTE_Pairs.Node1
               UNION ALL
               SELECT   CTE_Recursive.AnchorNode ,
                        CTE_Pairs.Node1 ,
                        CTE_Pairs.Node2 ,
                        CAST(CTE_Recursive.NodePath + CTE_Pairs.Node2 + ',' AS VARCHAR(8000)) AS NodePath ,
                        CTE_Recursive.Lvl + 1 AS Lvl
               FROM     CTE_Pairs
                        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = CTE_Pairs.Node1
               WHERE    CTE_Recursive.NodePath NOT LIKE CAST('%,'
                        + CTE_Pairs.Node2 + ',%' AS VARCHAR(8000))
             ),
        CTE_RecursionResult
          AS ( SELECT   AnchorNode ,
                        Node1 ,
                        Node2
               FROM     CTE_Recursive
             ),
        CTE_CleanResult
          AS ( SELECT   AnchorNode ,
                        Node1 AS Node
               FROM     CTE_RecursionResult
               UNION
               SELECT   AnchorNode ,
                        Node2 AS Node
               FROM     CTE_RecursionResult
             )
    SELECT  Edges.Node1 ,
            Edges.Node2 ,
            DENSE_RANK() OVER ( ORDER BY CASE WHEN CA_Data.XML_Value IS NULL
                                              THEN Edges.Node1
                                              ELSE CA_Data.XML_Value
                                         END ) AS Cluster
    FROM    @Edges Edges
            CROSS APPLY ( SELECT    CTE_CleanResult.Node + ','
                          FROM      CTE_CleanResult
                          WHERE     CTE_CleanResult.AnchorNode = Edges.Node1
                          ORDER BY  CTE_CleanResult.Node
                        FOR
                          XML PATH('') ,
                              TYPE
                        ) AS CA_XML ( XML_Value )
            CROSS APPLY ( SELECT    CA_XML.XML_Value.value('.',
                                                           'NVARCHAR(MAX)')
                        ) AS CA_Data ( XML_Value );

SQL Fiddle DEMO

This is a way more complicated answer than zarruq's and based on SQL Server, thus doesn't work in Teradata due to `cross apply`. — dnoeth, Aug 20 '17 at 15:06

How to identify groups/clusters in set of arcs/edges in SQL?

2 Answers2

Linked