4

I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".

Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.

I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.

Please note:

  • Due to company restrictions, I cannot use stored procedures.
  • I can't use user defined functions (UDFs)
  • I can't use user defined types (UDTs)

A correct answer will

  • Be written for Teradata or compatible
  • Be more efficient than my answer
  • Respect the restrictions I mention above

Sample Input:

Sample Input

Desired Output:

Desired Output

Community
  • 1
  • 1
transistor1
  • 2,915
  • 26
  • 42
  • Related: [Query to validate if items exist in partition group](http://stackoverflow.com/questions/16900351/query-to-validate-if-items-exist-in-partition-group), [Algorithm for grouping related items](http://stackoverflow.com/questions/9918993/algorithm-for-grouping-related-items) – transistor1 Oct 23 '16 at 23:11

3 Answers3

3

You need a recursive apporach, but your WITH RECURSIVE creates a humongous intermediate result, which leads to no more spool.

For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

Now repeat the following update until 0 rows processed:

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

To get the related categories you finally need:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

For an exact match of your expected result you need to add a DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt
dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • Thank you. +1 for your answer. I will try this when I get in front of a computer. One thing - is there any way to iterate the update statement automatically (in SQLA) e.g. without repeating the statement 60 times? I had thought to use an UPDATE using a cross-join, but unfortunately TD (14.00.03 I believe) required that UPDATEs be 1:1 – transistor1 Oct 23 '16 at 17:17
  • 1
    @transistor1: You probably don't need 60 recursions, but SQLA doesn't supports loops. You might put the update in a macro and then call it instead, e.g. `exec macro; .IF ACTIVITYCOUNT = 0 THEN GOTO .finish; ... repeat IF/EXEC n-times; LABEL .finish; final SELECT`. – dnoeth Oct 23 '16 at 17:29
  • This may work.. for whatever reason, I believe I *can* create macros in my workdb. I wasn't aware they supported iteration. – transistor1 Oct 23 '16 at 17:36
  • @transistor1: Macros don't suport iteration, they just simplify reusing the same source code multiple times. Btw, you can always create macros within your own user, DBAs don_t rstrict that.. – dnoeth Oct 23 '16 at 17:40
  • That `.IF` syntax is BTEQ syntax, right? Does that work in SQLA? I ask because the person who will be running this query (weekly) will be using SQLA. – transistor1 Oct 23 '16 at 17:45
  • 1
    @transistor1: Yep, it's BTEQ syntax, but also supported by SQLA. Maybe you can convince your DBA, that this is perfect task for a Stored Procedure. – dnoeth Oct 23 '16 at 17:53
  • Thanks! At the risk of getting into an extended conversation, let's just say I've already been down that road and there is too much red tape & corporate politics surrounding that option at this time... – transistor1 Oct 23 '16 at 17:57
  • "For to the one who has, more will be given, and he will have an abundance, but from the one who has not, even what he has will be taken away." [Matt 13:12](https://www.bible.com/bible/59/mat.13#12). Thanks for the help! – transistor1 Oct 27 '16 at 21:08
2

The concept of this solution is to avoid cycles while traversing through the edges.
It is done by computing the path's bitmap on the run and avoiding adding an edge which its category_2 is already in the bitmap.
All path starts with a self-referencing edge (e.g. 'A'--'A') in order to take the first node bitmap.
This solution lowers the number of result records of the iterative sub-query (edges) to 70.


The limitation of this solution is that the bitmap size should be defined in advance.
Every digit in the varbinary literal stands for 4 bits.
64 bit = '0000000000000000'xb
128 bit = '00000000000000000000000000000000'xb
etc.


with        category_group_bitmap (group_id,category_1,bitflag_group)
            as 
            (
                select      group_id
                           ,category_1

                           ,dense_rank () over 
                            (
                                partition by    group_id 
                                order by        sum (distinct category_2_n)
                            ) as bitflag_group

                from        edges

                group by    group_id
                           ,category_1
            )  

           ,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
            as
            (
                select      1                                           as n
                           ,m.group_id
                           ,m.category_1
                           ,m.category_2
                           ,gc.categories_num
                           ,setbit ('0000000000000000'xb,m.category_2_n)  as bitmap
                           ,m.category_2_n

                from                    match_detail_category_2_n   as m

                            join        group_categories_num        as gc

                            on          gc.group_id =
                                        m.group_id

                where       m.category_1    =
                            m.category_2

                union all

                select      e.n + 1                             as n
                           ,e.group_id
                           ,e.category_1
                           ,m.category_2
                           ,e.categories_num
                           ,setbit (e.bitmap,m.category_2_n)    as bitmap
                           ,m.category_2_n

                from                    edges                       as e

                            join        match_detail_category_2_n   as m

                            on          m.group_id  =
                                        e.group_id

                                    and m.category_1    =
                                        e.category_2

                where       e.n <   e.categories_num - 1
                        and getbit (e.bitmap,m.category_2_n) = 0
            )

           ,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
            as 
            (   
                select      m.group_id
                           ,category_1
                           ,category_2

                           ,cast 
                            (
                                dense_rank () over 
                                (
                                    partition by    group_id 
                                    order by        category_2
                                ) - 1 

                                as byteint
                            )

                from        match_detail as m
            )

           ,group_categories_num (group_id,categories_num)
            as 
            (   
                select      group_id
                           ,count (distinct category_1)

                from        match_detail

                group by    group_id
            )

select      *

from        category_group_bitmap
;
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
  • @transistor1, just waned to verify you've seen this solution. – David דודו Markovitz Oct 28 '16 at 10:24
  • @transistor1, when would be know if it survives production? – David דודו Markovitz Oct 28 '16 at 12:19
  • If I can confirm that this works more efficiently than @dnoeth's solution, I'll change the accepted answer. I'm not going to provide any timeline. The bounty was created specifically for him, though, so that won't change-- and of course it's already awarded, so I can't change it if I wanted to! Whenever I'm searching for a Teradata solution, I almost always land on one of his answers; he really came through for me here as well, and I appreciate that. That being said, your solution is really cool and it looks very promising. – transistor1 Oct 28 '16 at 12:52
  • When I've started working on Teradata database @dnoeth was the one only saviour , so no need to convince me here :-) Also dnoeth solution is not based on bitmap and therefore is not limited by bitmap size, so in any case it should be marked as a correct solution. I'm simply curious. P.s. Earlier I've posted here a solution based on stored procedures that I'm using in our production env. The logic is similar to dnoeth solution and its working on hundreds of million of records, executing for 1-1.5 minute. – David דודו Markovitz Oct 28 '16 at 13:15
  • thanks for your understanding :-). I don't know if there's such a thing as a "Teradata MVP," but if there is, he should get it. I'm testing your answer now. Yes, I saw that answer, but in my current situation it's nearly impossible to get SP access where I am working now. I am in a very large company and there is lots of red tape & politics unfortunately. – transistor1 Oct 28 '16 at 13:22
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/126910/discussion-between-dudu-markovitz-and-transistor1). – David דודו Markovitz Oct 28 '16 at 13:23
  • @DuduMarkovitz: No need for multiple bitmaps, you can use a VARBYTE of any (?) size instead: `select setbit ('000000000000000000000000000000000000000000000000'xb,64);` Would be interesting to compare performance of the WHILE-loop and this recursion. – dnoeth Nov 05 '16 at 18:43
  • @dnoeth - I've overlooked this in the documentation - **this is so cool** ! P.s. TD 14.10/15.10 doc.: "The maximum supported size (n) for VARBYTE is 8192 bytes." – David דודו Markovitz Nov 05 '16 at 19:08
1

This works, but causes an "out of spool space" issue on real data.

Schema creation:

CREATE VOLATILE TABLE match_detail (
    group_id bigint
    , category_1 varchar(255)
    , category_2 varchar(255)
) PRIMARY INDEX (group_id)
    ON COMMIT PRESERVE ROWS;

INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');

Query:

WITH

 related_cats AS (
    SELECT
        group_id
        , category_1
        , SUM(DISTINCT bitflag) As bitflag_total
        , DENSE_RANK() OVER (
            PARTITION BY
                group_id
            ORDER BY
                group_id
                , bitflag_total
        ) As bitflag_group

    FROM bitflags

    GROUP BY 1, 2
)

, bitflags As (
    SELECT
        DISTINCT
            group_id
            , category_1
            , category_2
            , CAST
            (
                2 ** (DENSE_RANK() OVER (
                    PARTITION BY
                        group_id

                    ORDER BY group_id
                        , category_2) - 1)

                As bigint
            ) As bitflag

    FROM cat_join
    WHERE depth = 1
)

, RECURSIVE cat_join AS (
    SELECT DISTINCT
        c1.group_id
        , c1.category_1
        , c1.category_2
        , CAST
        (
            n.num_categories - 1
            As integer
        ) As max_depth
        , CASE
            WHEN c1.category_1 = c1.category_2 THEN 1
            ELSE max_depth
        END As depth
        , 1 As recursion

    FROM matches c1
    INNER JOIN num_categories n
        ON c1.group_id = n.group_id

    UNION ALL

    SELECT
        r1.group_id
        , r1.category_1
        , r2.category_2
        , r1.max_depth
        , CASE
            WHEN r1.category_1 = r1.category_2 THEN 1
            WHEN r1.category_1 = r2.category_2 THEN 1
            ELSE r1.depth - 1
        END As cur_depth
        , recursion + 1

        FROM cat_join r1
        INNER JOIN matches r2
            ON r1.group_id = r2.group_id
            AND r1.category_2 = r2.category_1

        WHERE
        r1.category_1 <> r2.category_2
        AND r1.category_1 <> r2.category_1
        AND
            (r1.depth - 1) > 0
)

, matches AS (
    SELECT
        d.group_id
        , d.category_1
        , d.category_2

    FROM match_detail d
    GROUP BY d.group_id, d.category_1, d.category_2
)

, num_categories AS (
    SELECT
        u.group_id
        , COUNT(DISTINCT u.category_2) AS num_categories

    FROM categories u
    GROUP BY u.group_id
)

, categories AS (
    SELECT DISTINCT
        u.group_id
        , u.category_1
        , u.category_2

    FROM match_detail u
)

SELECT *
FROM related_cats
transistor1
  • 2,915
  • 26
  • 42