Synapse top level parent hierarchy

Question

Heres the table structure;

code    name                      under 

1       National Sales Manager    1
2       regional sales manager    1 
3       area sales manager        2 
4       sales manager             3

How do I get the top level parent hierarchy like below;

code    name                      under     ultimateparent

1       National Sales Manager    1         1
2       regional sales manager    1         1
3       area sales manager        2         1
4       sales manager             3         1

On a regular sql server I would use a recursive CTE as shown in SQL Server function to get top level parent in hierarchy. However it is not supported in synapse db's.

Your options are to use some sort of looping mechanism or if you know the maximum depth, you can construct a query to do what you want. — Gordon Linoff, Apr 09 '21 at 12:34

score 1 · Answer 1 · answered Apr 12 '21 at 20:49

You have a few options. I will describe three here:

if you have a nearby Azure SQL DB and the volumes aren't too large, present the table in that Azure SQL DB using CREATE EXTERNAL TABLE or simply use Azure Data Factory (ADF) to copy the data over, perform your recursive CTE then use ADF to port it back over. Alternately use some kind of pre-processing before this data gets into your SQL pool.
Recursive CTE is just a type of loop at the end of the day so WHILE is supported on Synapse. Now obviously this type of loop does not translate well to Synapse as it’s chatty but may be an option for small volumes with low depths of hierarchy. It's up to you to examine the trade-off in using the MPP architecture ineffectively in this way versus coding up an alternative.

I've coded up an example of option 2 which took over 20 seconds to run for just a few rows. Normally I would regard that as unacceptable but as mentioned it's up to you to weigh up the alternatives:

IF OBJECT_ID('dbo.someHierarchy') IS NOT NULL
    DROP TABLE dbo.someHierarchy;

CREATE TABLE dbo.someHierarchy (
    code        INT NOT NULL,
    [name]      VARCHAR(50) NOT NULL,
    under       INT NOT NULL
)
WITH
    (
    DISTRIBUTION = ROUND_ROBIN,
    HEAP
    );


INSERT INTO dbo.someHierarchy ( code, [name], under )
SELECT 1, 'National Sales Manager', 1
UNION ALL
SELECT 2, 'Regional Sales Manager', 1
UNION ALL
SELECT 3, 'Area Sales Manager', 2
UNION ALL
SELECT 4, 'Sales Manager', 3

INSERT INTO dbo.someHierarchy ( code, [name], under )
SELECT 5, 'Lead Bob', 5
UNION ALL
SELECT 6, 'Main Bob', 5
UNION ALL
SELECT 7, 'Junior Bob 1', 6
UNION ALL
SELECT 8, 'Junior Bob 2', 6

INSERT INTO dbo.someHierarchy ( code, [name], under )
SELECT 9, 'Jim - CEO', 9
UNION ALL
SELECT 10, 'Tim - CFO', 9
UNION ALL
SELECT 11, 'Rob - CIO', 9
UNION ALL
SELECT 12, 'Bob - VP', 10
UNION ALL
SELECT 13, 'Shon - Director', 12
UNION ALL
SELECT 14, 'Shane - VP', 11
UNION ALL
SELECT 15, 'Sheryl - VP', 11
UNION ALL
SELECT 16, 'Dan - Director', 15
UNION ALL
SELECT 17, 'Kim - Director', 15
UNION ALL
SELECT 18, 'Carlo - PM', 16
UNION ALL
SELECT 19, 'Monty - Sr Dev', 18
UNION ALL
SELECT 20, 'Chris - Sr Dev', 18




IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp;

CREATE TABLE #tmp (
    xlevel          INT NOT NULL,
    code            INT NOT NULL,
    [name]          VARCHAR(50) NOT NULL,
    under           INT NOT NULL,
    ultimateParent  INT NOT NULL
    );
    

-- Insert first level; similar to anchor section of CTE
INSERT INTO #tmp ( xlevel, code, [name], under, ultimateParent )
SELECT 1 AS xlevel, code, [name], under, under AS ultimateParent
FROM dbo.someHierarchy
WHERE under = code;


-- Loop section
DECLARE @i INT = 1

WHILE EXISTS (
    SELECT * FROM dbo.someHierarchy h
    WHERE NOT EXISTS ( SELECT * FROM #tmp t WHERE h.code = t.code )
    )
BEGIN

    -- Insert subsequent levels; similar to recursive section of CTE
    INSERT INTO #tmp ( xlevel, code, [name], under, ultimateParent )
    SELECT t.xlevel + 1, h.code, h.[name], h.under, t.ultimateParent
    FROM #tmp t
        INNER JOIN dbo.someHierarchy h ON t.code = h.under
    WHERE h.under != h.code
      AND t.xlevel = @i;

    -- Increment counter
    SET @i += 1

    -- Loop guard
    IF @i > 99
    BEGIN
        RAISERROR( 'Too many loops!', 16, 1 )
        BREAK
    END
END

SELECT 'loop' s, *
FROM #tmp
ORDER BY code, xlevel;

Results:

The condition is the WHILE EXISTS loop is a particularly expensive way to do it so maybe there is a simpler way with your data.

A third alternative is to use an Azure Synapse Notebook and library like GraphFrames to walk the hierarchy. There way be simpler ways to do it but I found the Connected Components method was able to determine the ultimate manager. One advantage of using GraphFrames is that it would allow more complex graph queries eg using motifs if required. This notebook is using the Spark (Scala) version:

Upload the right version of the graphFrames library to Spark:

%%configure -f
{
    "conf": {
        "spark.jars": "abfss://{yourContainer}@{yourDataLake}.dfs.core.windows.net/synapse/workspaces/{yourWorkspace}/sparkpools/{yourSparkpool}/libraries/graphframes-0.8.1-spark2.4-s_2.11.jar",
    }
}

Configure the elements with curly brackets for your environment.

Import the relevant libraries:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._

Get the data from the dedicated SQL pool and assign it to a dataframe:

// Get a table from Synapse dedicated SQL pool, select / rename certain columns from it to vertices and edge dataframes
val df = spark.read.synapsesql("yourDB.dbo.someHierarchy")
val v = df.selectExpr("code AS id", "name AS empName", "under")

v.show

// Reformat the code/under relationship from the original table 
// NB Exclude because in graph terms these don't have an edge
val e = df.selectExpr("code AS src", "under AS dst", "'under' AS relationship").where("code != under")

e.show

Create the graphframe from the vertices and edges dataframes:

// Create the graph frame
val g = GraphFrame(v, e)
print(g)

Set a checkpoint for connectedComponents:

// The connected components adds a component id to each 'group'
// Set a checkpoint to start
sc.setCheckpointDir("/tmp/graphframes-azure-synapse-notebook")

Run connected components algorithm against the data:

// Run connected components algorithm against the data
val cc = g.connectedComponents.run() // doesn't work on Spark 1.4
display(cc)

Join the original vertices dataframe and the result of the connected components algorithm and write it back to Azure Synapse dedicated SQL pool:

val writeDf = spark.sqlContext.sql ("select v.id, v.empName, v.under, cc.component AS ultimateManager from v inner join cc on v.id = cc.id")

//display(writeDf)
writeDf.write.synapsesql("someDb.dbo.someHierarchy2", Constants.INTERNAL)

Results:

I have a feeling there would be a simpler way to accomplish this with notebooks but look forward to seeing some alternatives. Upvote the feedback item for recursive CTEs on Synapse here:

https://feedback.azure.com/forums/307516-azure-synapse-analytics/suggestions/14876727-support-for-recursive-cte

score 0 · Answer 2 · answered Apr 14 '21 at 23:16

0

Have you considered or tried landing the data in a json file and using a Synapse dataflow to flatten the hierarchy for you?

answered Apr 14 '21 at 23:16

Jordan J

31
1

Synapse top level parent hierarchy

2 Answers2

Linked