3

I have a query that creates a sorted dictionary (sorted in that there is an incremental id to identify relative position of the keys).

I then wish to know, for each row, if the value exists as a key in any other row later on in the dictionary. I'm doing that with a correlated query in a CROSS APPLY. Effectively a self join on a CTE.

As far as I understand it, this means that the CTE that represents the Dictionary will have to be computed twice?

Other than using table variables (it's inside a function), does anyone have any alternative suggestions?

WITH
  dictionary([id], [key], [val]) AS
(
            SELECT 1, 'a', 'b' 
  UNION ALL SELECT 2, 'b', 'c'
  UNION ALL SELECT 3, 'c', 'a'
  UNION ALL SELECT 4, 'x', 'w'
  UNION ALL SELECT 5, 'y', 'x'
  UNION ALL SELECT 6, 'z', 'y'
)
SELECT
  *
FROM
  dictionary   dict
CROSS APPLY
(
  SELECT COUNT(*) FROM dictionary WHERE dictionary.id > dict.id AND dictionary.[key] = dict.[val]
)
  lookup(hits)
CROSS APPLY
(
  SELECT 1, 3 WHERE lookup.hits = 0
  UNION ALL
  SELECT 1, 2 WHERE lookup.hits > 0
  UNION ALL
  SELECT 2, 3 WHERE lookup.hits > 0
)
  map([from], [to])

-- [key]s 'c', 'x', 'y' and 'z' should only have one output rows

-- It's "acceptable" for only 'z' to have just one output row IFF a self join can be avoided

Other options that I can think of are all variants on the self-join...

  dictionary   dict
LEFT JOIN
(
  SELECT key, MAX(id) AS id FROM dictionary GROUP BY key
)
  lookup
    ON  lookup.key = dict.value
    AND lookup.id  > dict.id

Or...

  dictionary   dict
OUTER APPLY
(
  SELECT 1 WHERE EXISTS (SELECT * FROM dictionary WHERE dictionary.id > dict.id AND dictionary.key = dict.value)
)
  lookup(hits)

I am, however, trying to avoid the self join of a CTE, possibly with windowed functions the I haven't thought of? Anything just to avoid the CTE being computed twice...

(Ignoring the lookup.id > dict.id aspect is fine, if it means avoiding the self join...)

EDIT: More complete example added, and a SQL Fiddle too, thanks to @MartinSmith pointing out some inconsistencies...

http://sqlfiddle.com/#!6/9eecb7db59d16c80417c72d1e1f4fbf1/17407

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • @MartinSmith - Which would make the `LEFT JOIN` the least worst option at present. – MatBailie Jan 01 '18 at 18:51
  • How often *anything* is computed or accessed is up to the optimizer. Just because a CTE appears multiple times in a query doesn't mean that the CTE will be "computed twice". The entire query, including any CTEs, is optimized as a whole and the system *tries* to minimize certain costs (such as I/O). To know where it's being inefficient, you need to inspect execution plans, not just assume that a certain language feature is a source of poor performance. – Damien_The_Unbeliever Jan 02 '18 at 09:55
  • @Damien_The_Unbeliever - Although I agree with the sentiment, in this case I believe that it's almost a semantic argument. It is a fact that with the code above the execution plan ***can*** *(as opposed to **will**)* compute the CTE multiple times, where as in MartinSmith's answer it ***won't*** compute the CTE multiple times. – MatBailie Jan 02 '18 at 11:04

1 Answers1

2

Here's one way you could use windowed functions.

First unpivot the rows out so that keys and values become generic terms then using MAX ... OVER (PARTITION BY term) to find the id of the highest row where that term is used as a key.

In this example it then sets a flag and discards the duplicate row added by the unpivoting (preserving the context = 'v' row from the pair as this is the one with the information required for the flag).

You can then use that to join onto a table value constructor containing your map values.

WITH dictionary(id, [key], value)
     AS (
            SELECT 1, 'a', 'b' 
  UNION ALL SELECT 2, 'b', 'c'
  UNION ALL SELECT 3, 'c', 'a'
  UNION ALL SELECT 4, 'x', 'w'
  UNION ALL SELECT 5, 'y', 'x'
  UNION ALL SELECT 6, 'z', 'y'   
     ),
     t1
     AS (SELECT dict.*,
                context,
                highest_id_where_term_is_key = MAX(CASE
                                                     WHEN context = 'k'
                                                       THEN v.id
                                                   END) OVER (PARTITION BY term)
         FROM   dictionary dict
                CROSS APPLY (VALUES(id, [key], 'k'),
                                   (id, value, 'v')) v(id, term, context)),
     t2
     AS (SELECT *,
                val_in_later_key = CASE
                                     WHEN id < highest_id_where_term_is_key
                                       THEN 1
                                     ELSE 0
                                   END
         FROM   t1
         WHERE  context = 'v' 
         -- Discard duplicate row from the unpivot - only want the "value" row
        )
SELECT id,
       [key],
       value,
       highest_id_where_term_is_key,
       map.[from],
       map.[to]
FROM   t2
       JOIN (VALUES (1, 3, 0),
                    (1, 2, 1),
                    (2, 3, 1) ) map([from], [to], [flg])
         ON map.flg = t2.val_in_later_key
ORDER  BY id 

Returns

+----+-----+-------+------------------------------+------+----+
| id | key | value | highest_id_where_term_is_key | from | to |
+----+-----+-------+------------------------------+------+----+
|  1 | a   | b     | 2                            |    1 |  2 |
|  1 | a   | b     | 2                            |    2 |  3 |
|  2 | b   | c     | 3                            |    1 |  2 |
|  2 | b   | c     | 3                            |    2 |  3 |
|  3 | c   | a     | 1                            |    1 |  3 |
|  4 | x   | w     | NULL                         |    1 |  3 |
|  5 | y   | x     | 4                            |    1 |  3 |
|  6 | z   | y     | 5                            |    1 |  3 |
+----+-----+-------+------------------------------+------+----+
Martin Smith
  • 438,706
  • 87
  • 741
  • 845