-1

Running SQL Server 14.0.2037.2 on Windows 10 Enterprise version 21H1

I am trying to return IDs only when another columns entry across all rows within that ID matches the entries of a similar column in a second table, again within specific IDs. For example,

create table #F(patentid VARCHAR(8), subclass VARCHAR(3));

insert into #F values ('l','x');
insert into #F values ('l','y');
insert into #F values ('l','z');

insert into #F values ('m','x');

insert into #F values ('n','z');


create table #P(patentid VARCHAR(8), subclass VARCHAR(3));


insert into #P values ('b','x');

insert into #P values ('c','w');
insert into #P values ('c','x');
insert into #P values ('c','y');
insert into #P values ('c','z');

insert into #P values ('d','x');
insert into #P values ('d','y');
insert into #P values ('d','z');

insert into #P values ('e','x');
insert into #P values ('e','y');
insert into #P values ('e','z');

I am trying to write a query that will return, for each patentid in #F, all patentIDs in #P where the subclass entries match exactly. This true for "l" in #F that matches with "d" and "e" in #P [the matching subclasses are 'x', 'y' and 'z'] as well as for "m" in #F that matches with "b" in #P [the matching subclass is "x"]

the output should be:

l, d
l, e
m, b

The code needs to be efficient as both #F and #P contain millions of rows. I have tried Union, but that works with entire sets of rows and doesn't allow matching by a subset (i.e. within a patentid)

Any help much appreciated.

Charlieface
  • 52,284
  • 6
  • 19
  • 43
user2272413
  • 169
  • 1
  • 1
  • 5
  • This might help: -- What you want is a join where the number of items in the source matches the number of items in the joined table and matches the number of items in the join – Hogan Oct 06 '21 at 19:54

2 Answers2

1

This is an example of Relational Division Without Remainder, with multiple divisors.

In other words you want to divide (in a relational sense) #F by #P and only take results where there is no remainder.

There are many solutions, here is one

  • Partition both tables by patentid and calculate a windowed count
  • Left join one table to the other by subclass, filtering against exact matching counts
  • Group by pairs of IDs
  • Filter out anything which doesn't have every f.subclass with a matching p.subclass, we do this with HAVING COUNT(*) = COUNT(p.subclass) because COUNT(p.subclass) only counts non-nulls.
SELECT
  f.patentid, p.patentid
FROM (
    SELECT *,
      cnt = COUNT(*) OVER (PARTITION BY f.patentid)
    FROM #F f
) f
LEFT JOIN (
    SELECT *,
      cnt = COUNT(*) OVER (PARTITION BY p.patentid)
    FROM #P p
) p ON p.cnt = f.cnt AND p.subclass = f.subclass
GROUP BY
  f.patentid, p.patentid
HAVING COUNT(*) = COUNT(p.subclass);

db<>fiddle

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • This is great and worked perfectly. Thank you so much @Charlieface ! I learnt something today.. (partitioning selects sets of rows..) – user2272413 Oct 07 '21 at 17:48
0

Is this efficient enough for you?

select f.patentId, p.patentid 
from (
    select patentId, STRING_AGG(subclass,'') WITHIN GROUP (ORDER BY subclass ASC) as class 
    from #F 
    group by patentId
) f
inner join (
    select patentId, STRING_AGG(subclass,'') WITHIN GROUP (ORDER BY subclass ASC) as class 
    from #P 
    group by patentId
) p on f.class = p.class
Dale K
  • 25,246
  • 15
  • 42
  • 71
bogdan ioan
  • 378
  • 3
  • 7
  • A good answer explains how it works, in addition to providing working code. – Dale K Oct 06 '21 at 20:29
  • this does not work -- this will give multiple results for m since x is in many groups -- you also have to check the counts (as I said in my comment) – Hogan Oct 06 '21 at 21:09