Finding the id's which include multiple criteria in long format

Question

Suppose I have a table like this,

id	tagId
1	1
1	2
1	5
2	1
2	5
3	2
3	4
3	5
3	8

I want to select id's where tagId includes both 2 and 5. For this fake data set, It should return 1 and 3.

I tried,

select id from [dbo].[mytable] where tagId IN(2,5)

But it takes 2 and 5 into account respectively. I also did not want to keep my table in wide format since tagId is dynamic. It can reach any number of columns. I also considered filtering with two different queries to find (somehow) the intersection. However since I may search more than two values inside the tagId in real life, it sounds inefficient to me.

I am sure that this is something faced before when tag searching. What do you suggest? Changing table format?

Mureinik · Accepted Answer · 2021-12-24T21:17:27.863

3

One option is to count the number of distinct tagIds (from the ones you're looking for) each id has:

SELECT   id
FROM     [dbo].[mytable] 
WHERE    tagId IN (2,5)
GROUP BY id
HAVING   COUNT(DISTINCT tagId) = 2

edited Dec 24 '21 at 21:17

answered Dec 24 '21 at 21:08

Mureinik

297,002
52
306
350

Clever solution. Little typo: `DISTINCT`. – Zakaria Dec 24 '21 at 21:14
The number 2 at the last step represents _at least_ condition for the number of the tags I am searching. Right? – maydin Dec 24 '21 at 21:16
@Zaki indeed a typo, thanks for noticing. Edited and fixed – Mureinik Dec 24 '21 at 21:18
1

@maydin The `where` clause limits the return rows to tagIds of 2 and 5, and then the `having` clause makes sure they are both there. If an ID has additional tags it won't interfere with this query – Mureinik Dec 24 '21 at 21:19

score 2 · Answer 2 · answered Dec 25 '21 at 19:26

This is actually a Relational Division With Remainder question.

First, you have to place your input into proper table format. I suggest you use a Table Valued Parameter if executing from client code. You can also use a temp table or table variable.

DECLARE @ids TABLE (tagId int PRIMARY KEY);
INSERT @ids VALUES (2), (5);

There are a number of different solutions to this type of question.

Classic double-negative EXISTS

SELECT DISTINCT
  mt.Id
FROM mytable mt
WHERE NOT EXISTS (SELECT 1
    FROM @ids i
    WHERE NOT EXISTS (SELECT 1
        FROM mytable mt2
        WHERE mt2.id = mt.id
          AND mt2.tagId = i.tagId)
);

This is not usually efficient though

Comparing to the total number of IDs to match

SELECT mt.id
FROM mytable mt
JOIN @ids i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM @ids);

This is much more efficient. You can also do this using a window function, it may be more or less efficient, YMMV.

SELECT mt.Id
FROM mytable mt
JOIN (
    SELECT *,
      total = COUNT(*) OVER ()
    FROM @ids i
) i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = MIN(i.total);

Another solution involves cross-joining everything and checking how many matches there are using conditional aggregation

SELECT mt.id
FROM (
    SELECT
      mt.id,
      mt.tagId,
      matches = SUM(CASE WHEN i.tagId = mt.tagId THEN 1 END),
      total = COUNT(*)
    FROM mytable mt
    CROSS JOIN @ids i
    GROUP BY
      mt.id,
      mt.tagId
) mt
GROUP BY mt.id
HAVING SUM(matches) = MIN(total)
   AND MIN(matches) >= 0;

db<>fiddle

There are other solutions also, see High Performance Relational Division in SQL Server

This is very informative. Thank you @Charlieface . I wish I have a chance to accept two answers together. But upvoted! — maydin, Dec 25 '21 at 22:56

Finding the id's which include multiple criteria in long format

2 Answers2