SQL Server skip duplicate relations (parent child) in recursion

Question

I have a tree, where specific node in tree can appear in another node in tree. (2 in my example):

                1
            /       \
        2               3
     /    \                 \
    4       5                   6
                                  \
                                    2
                                  /   \
                                 4     5

Notice 2 is duplicated. First under 1, and second under 6. My recursion is:

with cte (ParentId, ChildId, Field1, Field2) AS (
    select BOM.ParentId, BOM.ChildId, BOM.Field1, BOM.Field2
    from BillOfMaterials BOM 
    WHERE ParentId=x

    UNION ALL

    SELECT BOM.ParentId, BOM.ChildId, BOM.Field1, BOM.Field2 FROM BillOfMaterials BOM 
    JOIN cte on BOM.ParentId = cte.ChildId
)
select * from cte;

But the problem is that in result relation 2-4 and 2-5 is duplicated (first from relation 1-2 and second from relation 6-2):

ParentId    ChildId                 OtherFields
    1           2
    1           3
    2           4  /*from 1-2*/
    2           5  /*from 1-2*/
    3           6
    6           2
    2           4  /*from 6-2*/
    2           5  /*from 6-2*/

Is there any way, to skip visiting duplicated relationships? I do no see any logic why should recursion run over rows that are already in result. It would be faster. Something like that:

        with cte (ParentId, ChildId, Field1, Field2) AS (
            select BOM.ParentId, BOM.ChildId, BOM.Field1, BOM.Field2
            from BillOfMaterials BOM 
            WHERE ParentId=x

            UNION ALL

            SELECT BOM.ParentId, BOM.ChildId, BOM.Field1, BOM.Field2 FROM BillOfMaterials BOM 
            JOIN cte on BOM.ParentId = cte.ChildId
------>     WHERE (select count(*) FROM SoFarCollectedResult WHERE ParentId=BOM.ParentId AND ChildId=BOM.ChildId ) = 0
        )
        select * from cte;

I found this thread, but it is 8 years old.
I am using SQL server 2016.

If this is not possible, then my question is how can I remove duplicates from final result, but check distinct only on ParentId and ChildId columns?

Edited:

Expected result is:

ParentId    ChildId                 OtherFields
    1           2
    1           3
    2           4
    2           5
    3           6
    6           2

Your data is flawed. You can't put that together like that at all. How do you know that one set of 2,4 belongs under 1 and the other belongs under 6. If you need that kind of relationship you need to rebuild your data structure because as a standard parent child it isn't going to work. — Sean Lange, Jul 28 '17 at 18:41
There's a fundamental problem here. A tree can be traversed according to different methodologies. There's no real sense of 'first' -- who is to say which occurrence of a node deserves the children. Nodes cannot 'instanced' and then distinguished by their ancestry, that makes them not be nodes anymore. — Ryan B., Jul 28 '17 at 18:54
My data is OK for sure. In table I have record 1(parent) 2(child) 1(position) and 6(parent) 2 (child) 2(position). All this three fields gets unique position in tree. But there is no reason why I should recursively get all (maybe 10 levels or more) twice (in upper case) for node 2. — Makla, Jul 28 '17 at 21:33

LukStorms · Answer 1 · 2017-07-30T09:12:47.423

You can, with adding to 2 little tricks to the SQL.

But you need an extra Id column with a sequential number.
For example via an identity, or a datetime field that shows when the record was inserted.
For the simple reason that as far the database is concerned, there is no order in the records as they were inserted, unless you got a column that indicates that order.

Trick 1) Join the CTE record only to Id's that are higher. Because if they were lower then those are the duplicates you don't want to join.

Trick 2) Use the window function Row_number to get only those that are nearest to the Id the recursion started from

Example:

declare @BillOfMaterials table (Id int identity(1,1) primary key, ParentId int, ChildId int, Field1 varchar(8), Field2 varchar(8));

insert into @BillOfMaterials (ParentId, ChildId, Field1, Field2) values
(1,2,'A','1-2'),
(1,3,'B','1-3'),
(2,4,'C','2-4'),  -- from 1-2
(2,5,'D','2-5'),  -- from 1-2
(3,6,'E','3-6'),
(6,2,'F','6-2'),
(2,4,'G','2-4'),  -- from 6-2
(2,5,'H','2-5');  -- from 6-2

;with cte AS 
(
    select Id as BaseId, 0 as Level, BOM.*
    from @BillOfMaterials BOM 
    WHERE ParentId in (1)

    UNION ALL

    SELECT CTE.BaseId, CTE.Level + 1, BOM.*
    FROM cte 
    JOIN @BillOfMaterials BOM on (BOM.ParentId = cte.ChildId and BOM.Id > CTE.Id)
)
select ParentId, ChildId, Field1, Field2
from (
    select *
    --, row_number() over (partition by BaseId, ParentId, ChildId order by Id) as RNbase
    , row_number() over (partition by ParentId, ChildId order by Id) as RN
    from cte 
) q
where RN = 1
order by ParentId, ChildId;

Result:

ParentId ChildId Field1 Field2
-------- ------- ------ ------
1        2       A      1-2
1        3       B      1-3
2        4       C      2-4
2        5       D      2-5
3        6       E      3-6
6        2       F      6-2

Anyway, as a sidenote, normally a Parent-Child relation table is used differently.
More often it's just a table with unique Parent-Child combinations that are foreign keys to another table where that Id is a primary key. So that the other fields are kept in that other table.

But in this case 2 - 4 and 2 - 5 are also duplicate. If there are, for example, 10 levels beneath node 2, recursion must go all 10 levels down twice. And I want to avoid that if possible. — Makla, Jul 28 '17 at 21:20
@Makla Could you then add an expected result to your question? As you now express it, there would only be the 2-4 in the result, and no 2-5 anywhere. — LukStorms, Jul 28 '17 at 22:35

yanman1234 · Answer 2 · 2017-07-28T19:01:56.500

Change your last query from:

select * from cte;

To:

select * from cte group by ParentId, ChildId;

This will essentially take what you have right now, but go one step further and remove rows that have already appeared, which would take care of your duplicate problem. Just be sure that all * returns here is ParentId and ChildId, should it be returning other columns you will need to either add them to the GROUP BY or apply some sort of aggregator to it so that it can still group (max, min, count...).

Should you have more rows that you can't aggregate or group on, you could write the query as such:

select * from cte where ID in (select MAX(ID) from cte group by ParentId, ChildId);

Where ID would be your primary table id for cte. This would take the maximum id when rows matched, which would normally be your latest entry, if you want the earliest entry just change MAX() to MIN().

SQL Server skip duplicate relations (parent child) in recursion

2 Answers2