1

TLDR

How to match a set of sets against a single set and bind that to the corresponding rows?

Given a row that has a linked summary table with key/value pairs describing properties of that row, and a bunch of search-descriptions (target) that describes how to sum up stuff from that row, how can I find which search-descriptions match a given row, based on matching the properties-table against the key/value pairs in the search-description?

Simplified example:

CREATE TABLE TargetKeyValue(TargetId INT, TargetKey NVARCHAR(50), TargetValue NVARCHAR(50))
CREATE TABLE OriginalRows(Id INT, Cost DECIMAL, BunchOfOtherCols NVARCHAR(500),
    CONSTRAINT [PK_Id] PRIMARY KEY CLUSTERED ([Id] ASC))
CREATE TABLE RowKeyValue(RowId INT, KeyPart NVARCHAR(50), ValuePart NVARCHAR(50),
     CONSTRAINT [FK_RowId_Id] FOREIGN KEY (RowId) REFERENCES OriginalRows(Id))

INSERT INTO OriginalRows VALUES
    (1, 55.5, 'Some cool red coat'),
    (2, 80.0, 'Some cool green coat XL'),
    (3, 250.00, 'Some cool green coat L'),
    (4, 100.0, 'Some whiskey'),
    (5, 42.0, 'This is not a match')

INSERT INTO RowKeyValue VALUES
    (1, 'Color', 'Red'),
    (1, 'Size', 'XL'),
    (1, 'Kind', 'Coat'),
    (2, 'Color', 'Green'),
    (2, 'Size', 'XL'),
    (2, 'Kind', 'Coat'),
    (3, 'Color', 'Green'),
    (3, 'Size', 'L'),
    (3, 'Kind', 'Coat'),
    (4, 'Color', 'Green'),
    (4, 'Size', 'Medium'),
    (4, 'Kind', 'Whiskey')


INSERT INTO TargetKeyValue VALUES
    (55, 'Color', 'Red'),
    (56, 'Color', 'Green'),
    (56, 'Size', 'XL'),
    (57, 'Kind', 'Coat'),
    (58, 'Color', 'Green'),
    (58, 'Size', 'Medium'),
    (58, 'Kind', 'Whiskey')

This gives the following tables:


-- table OriginalRows
Id  Cost    BunchOfOtherCols
1   56      Some cool red coat
2   80      Some cool green coat XL
3   250     Some cool green coat L
4   100     Some whiskey
5   42      This is not a match

-- table RowKeyValue
RowId   KeyPart ValuePart
1       Color   Red
1       Size    XL
1       Kind    Coat
2       Color   Green
2       Size    XL
2       Kind    Coat
3       Color   Green
3       Size    L
3       Kind    Coat
4       Color   Green
4       Size    Medium
4       Kind    Whiskey

-- table TargetKeyValue
TargetId    TargetKey   TargetValue
55          Color       Red
56          Color       Green
56          Size        XL
57          Kind        Coat
58          Color       Green
58          Size        Medium
58          Kind        Whiskey

Expected outcome

The function below will give this correct outcome:

Id  Cost    BunchOfOtherCols            IsTargetMatch   TargetKeyId
1   56      Some cool red coat          1               55
2   80      Some cool green coat XL     1               56
3   250     Some cool green coat L      1               57
4   100     Some whiskey                1               58
5   42      This is not a match         0               NULL

In other words:

  • Bind the original row id to the target-id that it first matches (I'm ok with a join returning multiple times if that's easier)
  • Show the original row whenever it does not match
  • A match is true if a group belonging to one target-ids matches the same value of a given original row

Current approach with cursors... alas

The code below uses cursors, but this proves slow (understandably so since it's basically just a non-indexed table scan over and over).

Another approach I tried was using XML PATH queries, but that turned out to be a non-starter (it was easy, but too slow as well).

I'm aware this is a non-trivial task in relation databases, but I hope there's still a rather straightforward solution. What I have below kinda works, and I might just use a batch process to store the results or something, unless there's a better way to do this using SET operations or, idunno, FULL JOIN?

Any solution that can be used in a view (i.e., not involving dynamic SQL or calling SP's) is fine. We used to have an SP-based solution but since data needs to be analyzed in PowerBI and other systems, SQL Views and determinism are the way to go.

Here's a fully working minimal example of what I'm after. The function is the part that I'm looking at to replace with a less procedural and more functional, i.e. set-based approach:

CREATE TABLE TargetKeyValue(TargetId INT, TargetKey NVARCHAR(50), TargetValue NVARCHAR(50))
CREATE TABLE OriginalRows(Id INT, Cost DECIMAL, BunchOfOtherCols NVARCHAR(500),
    CONSTRAINT [PK_Id] PRIMARY KEY CLUSTERED ([Id] ASC))
CREATE TABLE RowKeyValue(RowId INT, KeyPart NVARCHAR(50), ValuePart NVARCHAR(50),
     CONSTRAINT [FK_RowId_Id] FOREIGN KEY (RowId) REFERENCES OriginalRows(Id))

INSERT INTO OriginalRows VALUES
    (1, 55.5, 'Some cool red coat'),
    (2, 80.0, 'Some cool green coat XL'),
    (3, 250.00, 'Some cool green coat L'),
    (4, 100.0, 'Some whiskey'),
    (5, 42.0, 'This is not a match')

INSERT INTO RowKeyValue VALUES
    (1, 'Color', 'Red'),
    (1, 'Size', 'XL'),
    (1, 'Kind', 'Coat'),
    (2, 'Color', 'Green'),
    (2, 'Size', 'XL'),
    (2, 'Kind', 'Coat'),
    (3, 'Color', 'Green'),
    (3, 'Size', 'L'),
    (3, 'Kind', 'Coat'),
    (4, 'Color', 'Green'),
    (4, 'Size', 'Medium'),
    (4, 'Kind', 'Whiskey')


INSERT INTO TargetKeyValue VALUES
    (55, 'Color', 'Red'),
    (56, 'Color', 'Green'),
    (56, 'Size', 'XL'),
    (57, 'Kind', 'Coat'),
    (58, 'Color', 'Green'),
    (58, 'Size', 'Medium'),
    (58, 'Kind', 'Whiskey')

GO


CREATE FUNCTION dbo.MatchTargetAgainstKeysFromRow
(
    @rowid INT
)
RETURNS @MatchResults TABLE(
    IsTargetMatch BIT,
    TargetKeyId INT)

AS
BEGIN

    --
    -- METHOD (1) (faster, by materializing the xml field into a cross-over lookup table)
    --

    -- single row from activities as key/value pairs multi-row
    DECLARE @rowAsKeyValue AS TABLE(KeyPart NVARCHAR(1000), ValuePart NVARCHAR(MAX))
    INSERT INTO @rowAsKeyValue (KeyPart, ValuePart)
        SELECT KeyPart, ValuePart FROM RowKeyValue WHERE RowId = @rowid


    DECLARE @LookupColumn NVARCHAR(100)
    DECLARE @LookupValue NVARCHAR(max)
    DECLARE @TargetId INT
    DECLARE @CurrentTargetId INT
    DECLARE @IsMatch INT
    DECLARE key_Cursor CURSOR
        LOCAL STATIC FORWARD_ONLY READ_ONLY
        FOR SELECT TargetKey, TargetValue, TargetId FROM TargetKeyValue  ORDER BY TargetId

    OPEN key_Cursor
    FETCH NEXT FROM key_Cursor INTO @LookupColumn, @LookupValue, @TargetId

    WHILE @@FETCH_STATUS = 0
    BEGIN
        SET @IsMatch = (SELECT COUNT(*) FROM @rowAsKeyValue WHERE KeyPart = @LookupColumn AND ValuePart = @LookupValue)
        IF(@IsMatch = 0)
        BEGIN
            -- move to next key that isn't the current key
            SET @CurrentTargetId = @TargetId
            WHILE @@FETCH_STATUS = 0 AND @CurrentTargetId = @TargetId
            BEGIN
                FETCH NEXT FROM key_Cursor INTO @LookupColumn, @LookupValue, @TargetId
            END
        END
        ELSE
            BEGIN
                SET @CurrentTargetId = @TargetId
                WHILE @@FETCH_STATUS = 0 AND @IsMatch > 0 AND @CurrentTargetId = @TargetId
                BEGIN
                    FETCH NEXT FROM key_Cursor INTO @LookupColumn, @LookupValue, @TargetId
                    IF @CurrentTargetId = @TargetId
                        SET @IsMatch = (SELECT COUNT(*) FROM @rowAsKeyValue WHERE KeyPart = @LookupColumn AND ValuePart = @LookupValue)
                END
                IF @IsMatch > 0
                BEGIN
                    -- we found a positive matching key, nothing more to do
                    BREAK
                END
            END
    END

    DEALLOCATE key_Cursor       -- deallocating a cursor also closes it

    INSERT @MatchResults
    SELECT
        (CASE WHEN (SELECT COUNT(*) FROM @rowAsKeyValue) > 0 THEN 1 ELSE 0 END),
        (CASE WHEN @IsMatch > 0 THEN @CurrentTargetId ELSE NULL END)

    RETURN
END

GO

-- function in action
select * from OriginalRows
cross apply dbo.MatchTargetAgainstKeysFromRow(Id) fn

-- cleanup
drop function dbo.MatchTargetAgainstKeysFromRow
drop table TargetKeyValue
drop table RowKeyValue
drop table OriginalRows
Abel
  • 56,041
  • 24
  • 146
  • 247
  • 2
    EAV designs *never* work well in RDBMS; if you can fix your design I would suggest you do. – Thom A Apr 29 '21 at 10:58
  • 1
    In the SQL you give us I notice that the parameters are a `uniqueidentifier`, where is that coming from? None of your columns are a guid. A full [mre], with DDL and DML statements, would likely help us help you here. – Thom A Apr 29 '21 at 11:08
  • On EAV designs, yeah, I'm quite aware of that, and I wished it could be redesigned differently, or maybe we shouldn't use an RDBMS to begin with. The problem is that users are free to select the columns they want to use in the target query, and while this is trivial to solve with dynamic sql, we couldn't go that way (must be views exported to PowerBI). So we tried a different approach (splitting the query-eligible columns into key/value pairs table). – Abel Apr 29 '21 at 11:48
  • This is eminently doable in a set-based query, but just want to clarify: we are trying to find for every row in `OriginalRows`, which `TargetId`s may match, and we do this by taking each `TargetId` and its `Key`/`Value` pairs and ensuring that they match the `Key`/`Value` pairs of the particular `OriginalRows`? We don't care if there are *other* key/values for that `OriginalRows`, but we *do care* that all `TargetKeyValue`s will match for a given `TargetId`? And if multiple `TargetId`s match then we return all of them? – Charlieface May 01 '21 at 23:50

1 Answers1

1

This question is a case of Relational Division With Remainder, with multiple dividends and divisors.

Relational division is basically the opposite of a join: in this case, we want to know which OriginalRows match which TargetIds, based on every key/value pair for the TargetId matching the key/value pairs of the OriginalRows.

There are many ways to do this, here are a few:

SELECT
    r.Id,
    r.Cost,
    r.BunchOfOtherCols,
    t.TargetId
FROM OriginalRows r
OUTER APPLY (
    SELECT ttKV.TargetId
    FROM TargetKeyValue tKV
    LEFT JOIN RowKeyValue rKV
        ON rKV.KeyPart = tKV.TargetKey AND rKV.ValuePart = tKV.TargetValue
        AND rKV.RowId = r.Id
    GROUP BY tKV.TargetId
    HAVING COUNT(*) = COUNT(rKV.RowId)  -- all target k/vs have match
) t;
SELECT
    r.Id,
    r.Cost,
    r.BunchOfOtherCols,
    tKV.TargetId
FROM OriginalRows r
CROSS JOIN TargetKeyValue tKV
LEFT JOIN RowKeyValue rKV
    ON rKV.KeyPart = tKV.TargetKey AND rKV.ValuePart = tKV.TargetValue
    AND rKV.RowId = r.Id
GROUP BY
    r.Id,
    r.Cost,
    r.BunchOfOtherCols,
    tKV.TargetId
HAVING COUNT(*) = COUNT(rKV.RowId)  -- all target k/vs have match
SELECT
    r.Id,
    r.Cost,
    r.BunchOfOtherCols,
    tKV.TargetId
FROM OriginalRows r
CROSS JOIN TargetKeyValue tKV
CROSS APPLY (VALUES (CASE WHEN EXISTS (SELECT 1
        FROM RowKeyValue rKV
            WHERE rKV.KeyPart = tKV.TargetKey AND rKV.ValuePart = tKV.TargetValue
            AND rKV.RowId = r.Id
        ) THEN 1 END
) ) rKV(IsMatch)
GROUP BY
    r.Id,
    r.Cost,
    r.BunchOfOtherCols,
    tKV.TargetId
HAVING COUNT(*) = COUNT(rKV.IsMatch)  -- all target k/vs have match

Instead of the HAVING COUNT(*) = COUNT(rKV.RowId) you could also replace that with

HAVING COUNT(CASE WHEN rKV.RowId IS NULL THEN 1 END) = 0 -- all target k/vs have match

If you want a function for a single OriginalRows, that is somewhat simpler:

CREATE FUNCTION dbo.MatchTargetAgainstKeysFromRow
(
    @rowid INT
)
RETURNS TABLE
AS RETURN (
    SELECT
        tKV.TargetId
    FROM TargetKeyValue tKV
    LEFT JOIN RowKeyValue rKV
        ON rKV.KeyPart = tKV.TargetKey AND rKV.ValuePart = tKV.TargetValue
        AND rKV.RowId = @rowid
    GROUP BY
        tKV.TargetId
    HAVING COUNT(*) = COUNT(rKV.RowId)  -- all target k/vs have match
);

GO
Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • This is very interesting, tx! While there we some (small) mistakes in the queries (i.e., table `t` is not in scope inside the `outer apply`), they roughly gave the result I needed (albeit as written not the same as in my example, but this was easy to solve). Unfortunately, however, these all run much slower than the procedural approach I showed (30s vs 6s). After internal discussion we decided to precalc the matches on `INSERT` so we can run them from an actual table. – Abel May 03 '21 at 10:42
  • Btw, I meanwhile found another approach that works similarly well, which counts the `INTERSECT` result of `TargetKeyValue` vs `RowKeyValue`, which runs in 2min10s, vs 30s for your first solution above (but still, the cursor-approach is the overall winner by a large margin). – Abel May 03 '21 at 10:46
  • I suspect indexing may be the cause of that. You could share a query plan by uploading to https://brentozar.com/pastetheplan please also add table and index definitions. Yes there are quite a few ways to do this, and dependent on your exact tables it may be faster or easier one way than the other. – Charlieface May 03 '21 at 11:30
  • that's a great way of sharing queryplans! I doubt it'll help much in this particular case, though. Taking your last function (modified to fit) and ran against the main table with 10k real-world rows, gives this https://www.brentozar.com/pastetheplan/?id=BymDoO6P_. As you can see, the expensive seeks are already indexed seeks, unless I miss something? – Abel May 03 '21 at 13:34
  • This is weird: your syntax at the end, `RETURNS TABLE AS RETURN(SELECT...` is an order of magnitude faster (30s vs 3min) than `RETURNS @T TABLE(TargetID INT) AS BEGIN ... INSERT @T SELECT ... RETURN END`. Arguably, these should perform the same, no? – Abel May 03 '21 at 14:00
  • 1
    No absolutely not. Your is a multi-statement TVF and mine is inline (it gets fully inlined into the outer query) see https://stackoverflow.com/questions/2554333/multi-statement-table-valued-function-vs-inline-table-valued-function and https://www.brentozar.com/archive/2017/05/sql-server-2017-interleaved-mstvfs-vs-inline-table-valued-functions/ – Charlieface May 03 '21 at 14:17
  • @Abel also looking at your query plan, you have a bunch of scans and hash joins, and a huge nested loop. It doesn't look particularly efficient, but couldn't advise without seeing the full definition of the view, tables and indexes, and ideally an *actual* as opposed to *estimated* plan. I suggest you make a new question if you want performance advice. – Charlieface May 04 '21 at 10:07