2

Suppose I have a hierarchical structure like this.

Father ID Child ID
1 2
1 3
1 7
3 4
3 5
3 6

In my case, those are production orders, that demand from (father) or supply to (child) each other. I need to bring those orders to a structure like this:

Master Father Hierarchy Child
1 001.001 2
1 001.002 3
1 001.002.001 4
1 001.002.002 5
1 001.002.003 6
1 001.003 7

Afterwards I have to join the components to the child orders, that were used to create the master father. The target is, to trace the origin of the components, that were used to create a certain product.

The big problem is, that I can't predict, how deep those tree structures will become in the future. But I want to create a report, that won't break, when they get deeper.

So far I counted the depth manually. If the Master Father is 0 then I have 2 Levels in my example case. The (pseudo) code for the transformation would look like this:

Select 
    L.Father_ID as Level0
    , '001' as Hierarchy0
    , L.Child_ID as Level1
    , Dense_Rank() over (partition by L.Father_ID) as Hierarchy1
    , R1.Child_ID as Level2
    , Dense_Rank() over (partition by L.Father_ID, L.Child_ID) as Hierarchy2
Into #exploded_table
From Table as L
Left Join Table as R1
on L.Child_ID  = R1.Father_ID 

Select distinct
    Level0 as Master_Father
    , Concat(Hierarchy0,'.',format(Hirarchy1, '000')) as Hierarchy
    , Level1 as Child
From #exploded_table
Union all
Select distinct
    Level0 as Master_Father
    , Concat(Hierarchy0,'.',format(Hirarchy1, '000'),'.',format(Hirarchy2, '000')) as Hierarchy
    , Level2 as Child
From #exploded_table

I have two problems with this code.

  1. It gets longer with every level
  2. It will break if more levels will be added in the future

Therefore I started to write dynamic code. It first counts the depth of the deepest tree and then it dynamically creates the code with as much levels as are needed.

When you execute this code the amount of columns is (as far as I understand it) called "Non-Deterministic". And MS SQL hates "Non-Deterministic" things. To be allowed to store this as a temp table I have to create a temp table outside of the scope of the EXEC sp_executesql function. And then I have to dynamically modify the columns of this table to fit exactly to the result of the dynamic SQL statement that I pass to the Exec function. I won't use global temp tables since this will lead to chaos when multiple reports have the same names for their temp tables.

The above described method is complicated and unreadable! I can do it, but it feels like the opposite of good practice.

So I want to ask the community whether this can't be done in a simpler way. I came from a company that used PySpark where this was easily doable and now I am here and all I have is SQL Server and the SQL Server Reporting Services.

Dale K
  • 25,246
  • 15
  • 42
  • 71

2 Answers2

1

I believe you are mostly along the right track. However, as you have defined columns (e.g., Master_Father_ID, Hierarchy, and Child_ID) you don't need dynamic SQL.

Rather than a recursive CTE, I've written the below as a WHILE loop to help understand what's happening. It's easy enough to convert to a recusive CTE but I think (especially to many people familiar with other programming languages) that a WHILE loop is more familiar.

The logic of the below is as follows

  1. Find any 'Master fathers' and insert them into the working table
  2. For each value inserted in the last step, find their direct children and insert them into the working table
  3. Repeat step 2 above (e.g., round 3 finds children of children; round 4 finds children of children of children) until no more data is inserted

Then the final report needs to simply ignore the initial Master_Father rows inserted (e.g., what would be 001, 002 etc in your table above).

The SQL below can also be found in this db<>fiddle - noting that I have added extra data for demonstration purposes.

Initial data

CREATE TABLE #exploded_table (
        Father_ID int, 
        Child_ID int, 
        PRIMARY KEY (Father_ID, Child_ID)
        );

INSERT INTO #exploded_table (Father_ID, Child_ID) VALUES
(1,  2),  (1,  3),  (1,  7),  (3,  4),  (3,  5),  (3,  6);

Now for the working table - I've called it #summary_table. Current_ID represents the current Child_ID for that row (e.g., at the deepest level of the hierarchy). insert_round is the round number (loop iteration) that the row was inserted.

I also have two variables: @round that indicates which round/cycle we're up to, and @n indicating the number of rows inserted in the current round.

CREATE TABLE #summary_table (
            Master_Father_ID int, 
            Current_ID int, 
            Hierarchy nvarchar(500), 
            insert_round int
            );

DECLARE @round int = 1;
DECLARE @n int = 0;

Step 1: Insert master fathers

Master fathers are those that are not themselves children e.g., Father_Id 1 is not a child, so it is included; Father_ID 3 is a child, therefore isn't a master father.

Note that I'm using RIGHT(N'000' + LTRIM(STR(int value)), 3) to convert any integer number to 3-digit format as a string.

INSERT INTO #summary_table (Master_Father_ID, Current_ID, Hierarchy, insert_round)
    SELECT  DISTINCT Father_ID, 
            Father_ID, 
            RIGHT(N'000' + LTRIM(STR(DENSE_RANK() OVER (ORDER BY Father_ID))), 3), 
            @round
    FROM    #exploded_table
    WHERE   Father_ID NOT IN (SELECT Child_ID FROM #exploded_table);

SET @n = @@ROWCOUNT;

Step 2: Find any children of the last round and insert their data into the working table

(Also Step 3: Repeat until no more is happening)

Taking the data from the last round of insertions (identified by #summary_table.insert_round), identify any children and insert their data into the working table.

Keep doing this until you don't insert any more rows.

WHILE @n > 0
    BEGIN

    INSERT INTO #summary_table (Master_Father_ID, Current_ID, Hierarchy, insert_round)
        SELECT  DISTINCT
                #summary_table.Master_Father_ID, 
                #exploded_table.Child_ID, 
                #summary_table.Hierarchy + N'.' + RIGHT(N'000' + LTRIM(STR(DENSE_RANK() OVER (PARTITION BY #summary_table.Master_Father_ID, #summary_table.Current_ID ORDER BY #exploded_table.Child_ID))), 3),
                @round + 1
        FROM    #summary_table
                INNER JOIN #exploded_table ON #summary_table.Current_ID = #exploded_table.Father_ID
        WHERE   #summary_table.insert_round = @round;

    SET @n = @@ROWCOUNT;

    SET @round += 1;
    
    END;

Then the final report

SELECT      Master_Father_ID, Hierarchy, Current_ID AS Child_ID
    FROM    #summary_table 
    WHERE   insert_round > 1
    ORDER BY Hierarchy;

Data is as follows

Master_Father_ID    Hierarchy       Child_ID
1                   001.001         2
1                   001.002         3
1                   001.002.001     4
1                   001.002.002     5
1                   001.002.003     6
1                   001.003         7

Note - this assumes your original data is set up properly and has no loops/etc (e.g., if Kyle Reese was John Connor's child as well as his father... err sorry, 'spolier alert'). If you have those, you need to add extra checks - and those checks will depend on how you actually want to deal with those loops.

seanb
  • 6,272
  • 2
  • 4
  • 22
  • This is genius. Thank you. I am very happy about the fact, that this method completely avoids making SQL strings. From my understanding of the business application, there shouldn't be loops. Yet there are some. But I think I will find a way to handle them ;) – Merlin Nestler Dec 26 '22 at 02:46
  • 1
    Thanks. There are times that are very hard to avoid loops - particularly where the results of some rows depend on changes to other rows made in the current command. Note that I did this as a WHILE loop because I think they are the easiest to understand; a recursive CTE (which is still a loop) would probably be marginally faster if you want to convert this to that format. Alternatively, if you have a lot of data, you could also improve processing a little but adding a CLUSTERED INDEX on #summary_table, just on `insert_round` - this will make identifying the previously inserted rows easier. – seanb Dec 26 '22 at 04:59
  • When using your code I was thinking about this line: `WHERE Father_ID NOT IN (SELECT Child_ID FROM #exploded_table);` For every line in "#exploded_table" you make one subquery searching for hits. Wouldn't it be more efficient to first join #exploded table with itself (Father on Child) and then make a condition `where Child (from joined table) = NULL` I expect this to have less overhead. What do you think? – Merlin Nestler Jan 20 '23 at 01:19
  • 1
    Hi @MerlinNestler, good timing! Let's call the two versions `NOT IN` and `Child IS NULL`. I think it's incorrect to say that the `NOT IN` version reads the table once per father; instead, it does a LEFT OUTER SEMI join which may be nested loops or as the whole data set at once. The `Child IS NULL` version does a LEFT OUTER JOIN, and can also be done as nested loops or the whole data set at once. If the father and child relation is at most 1-to-1 then they will perform the same; and the `NOT IN` version will perhaps perform infinitesimally better if it's many-to-1. – seanb Jan 20 '23 at 01:30
  • 1
    Also, at a higher level, SQL Server does always do what you tell it. It is `declarative` in that you tell it the answer you want, and it works out the way to best get the data to do so. For example, if you JOIN to a view, it doesn't necessarily run that view independently. Instead, it gets the parts of the view it needs and then incorporates it into its current plan. So, often, I find the best versions are the easiest-to-read versions because they'll often get exactly the same plan (or close to the same plan) anyway. – seanb Jan 20 '23 at 01:33
  • 1
    And if you need proof that SQL Server doesn't do what you tell it to do... try these commands (but replace myTab for a table you have). Which two of these commands have the same execution plan? `IF EXISTS (SELECT * FROM myTab) SELECT 1; IF (SELECT COUNT(*) FROM myTab) = 0 SELECT 1; IF (SELECT COUNT(*) FROM myTab) = 1 SELECT 1;` Hint: It's not the second and third! – seanb Jan 20 '23 at 01:38
  • As a takeaway from your comments I make the following mental note: When I build a lot of casts in my code, the code will be slow because I ordered SQL explicitly to do this. But when it comes to the question of how to join data, SQL will build an execution plan in a way, that all possible ways of writing the actual code will have similar runtimes. That is one of the values of MS SQL. Therefore often it is the correct move to write readable code instead of code that feels more efficient. Thank you for your explanation. – Merlin Nestler Jan 20 '23 at 05:36
1

You should be able to do this with Recursive CTE.

I have also included an example of a build in SQL Server hierarchy data type (HIERARCHYID).

See comments in code for details

-- Test data based on your example
CREATE TABLE #temp( FatherID INT, ChildID INT )
INSERT INTO #temp
VALUES ( 1, 2 ), ( 1, 3 ), ( 1, 7 ), ( 3, 4 ), ( 3, 5 ), ( 3, 6 ),
-- I have added a 3rd level
( 6, 8 ), ( 6, 9 )
-- SELECT * FROM #temp;

-- Recursive CTE to get Children of Children and construct Hierarchy
;WITH Hierarchies( MasterFatherID, FatherID, ChildID, Hierarchy, Hierarchy2 )
AS(
    -- This is the "anchor" part
    SELECT FatherID AS MasterFatherID, FatherID, ChildID,
        FORMAT( DENSE_RANK() OVER( PARTITION BY FatherID ORDER BY ChildID ), '000' ) AS Hierarchy,
        -- This is an example of SQL Server built in Hierarchy data type
        CAST( '/' + FORMAT( DENSE_RANK() OVER( PARTITION BY FatherID ORDER BY ChildID ), '0' ) + '/' AS HIERARCHYID ) AS Hierarchy2
    FROM #temp
    UNION ALL
    SELECT
        -- Top level Parent
        Hierarchies.MasterFatherID,
        -- Current Parent
        t.FatherID,
        -- Current Child
        t.ChildID,
        Hierarchies.Hierarchy + '.' + FORMAT( DENSE_RANK() OVER( PARTITION BY t.FatherID ORDER BY t.ChildID ), '000' ) AS Hierarchy,
        -- This is an example of SQL Server built in Hierarchy data type
        HIERARCHYID::Parse( Hierarchies.Hierarchy2.ToString() + FORMAT( DENSE_RANK() OVER( PARTITION BY t.FatherID ORDER BY t.ChildID ), '0' ) + '/' ) AS Hierarchy2
    FROM Hierarchies
        INNER JOIN #temp AS t ON Hierarchies.ChildID = t.FatherID
)
SELECT MasterFatherID, FatherID, ChildID, Hierarchy, Hierarchy2.ToString() AS Hierarchy2
FROM Hierarchies AS a
-- This will exclude Children (e.g. [3, 4]) from being included as their own chain
WHERE NOT EXISTS( SELECT * FROM Hierarchies AS Children WHERE Children.ChildID = a.MasterFatherID )
ORDER BY MasterFatherID

Results

MasterFatherID FatherID    ChildID     Hierarchy      Hierarchy2
-------------- ----------- ----------- -------------- ----------
1              1           2           001            /1/
1              1           3           002            /2/
1              1           7           003            /3/
1              3           4           002.001        /2/1/
1              3           5           002.002        /2/2/
1              3           6           002.003        /2/3/
1              6           8           002.003.001    /2/3/1/
1              6           9           002.003.002    /2/3/2/
Alex
  • 4,885
  • 3
  • 19
  • 39
  • Thank you for your reply Alex. With recursive SQL I made the experience that the code tends to take way more time than using another method. The server that I work with is rather weak so to assure quick response times for reports I am very dependent on solutions with quick runtimes. I didn't test the speed of your method yet, but this is my reason for accepting the anser from Sean. – Merlin Nestler Dec 26 '22 at 02:58