2

I'm trying to link 2 columns (on 2 separate tables) together so that if every word in one column is contained in the other then they will match.

For example, the following values should match:

Paul Smith|Paul Andrew Smith
Paul Smith|Paul Andrew William Smith
Paul William Smith|Paul Andrew William Smith
Paul Andrew Smith|Paul Smith

but the following should NOT be matched:

Paul William Smith|Paul Andrew Smith

I'm using SQL Server 2016.

I'd like to do this with a SELECT query. I have a vague idea of using the string_split function (on spaces), cross applying the 2 tables then using the MAX function but this would create several millions of rows if I'm dealing with just a few thousand names so it wouldn't be very efficient.

Sample data:

DROP TABLE IF EXISTS #TEMP1
DROP TABLE IF EXISTS #TEMP2

CREATE TABLE #TEMP1 (NAME NVARCHAR(300))
CREATE TABLE #TEMP2 (NAME NVARCHAR(300))

INSERT #TEMP1 SELECT 'Paul Smith'
INSERT #TEMP1 SELECT 'Amy Nicholas Stanton'
INSERT #TEMP1 SELECT 'Andrew James Thomas'

INSERT #TEMP2 SELECT 'Paul Andrew Smith'
INSERT #TEMP2 SELECT 'Amy Stanton'
INSERT #TEMP2 SELECT 'Andrew Marcus Thomas'

So from the sample data, the first 2 rows should match and 3 rows shouldn't match.

EDIT: I've put my vague idea into practice, the following solution works but as I expected it's really slow when you're dealing with tables that contain thousands of rows.

SELECT DISTINCT A.[FIRSTNAME],A.[SECONDNAME]
FROM (
    SELECT *
          ,MIN([FIRSTMATCH]) OVER(PARTITION BY [SRN],[FIRSTNAME]) [FM]
          ,MIN([SECONDMATCH]) OVER(PARTITION BY [FRN],[SECONDNAME]) [SM]
    FROM (
            SELECT  DISTINCT A.NAME [FIRSTNAME]
                            ,B.NAME [SECONDNAME]
                            ,A.value [FIRSTVAL]
                            ,MAX(IIF(A.VALUE=B.VALUE,1,0)) OVER(PARTITION BY A.VALUE,B.RN) [FIRSTMATCH]
                            ,B.value [SECONDVAL]
                            ,MAX(IIF(B.VALUE=A.VALUE,1,0)) OVER(PARTITION BY B.VALUE,A.RN)  [SECONDMATCH]
                            ,A.RN [FRN]
                            ,B.RN [SRN]
            FROM (
                    SELECT DISTINCT NAME, DENSE_RANK() OVER(ORDER BY NAME) [RN],value
                    FROM #TEMP1
                    CROSS APPLY STRING_SPLIT(LTRIM(RTRIM(NAME)),' ')
                    WHERE LTRIM(RTRIM(NAME)) !=''
            )A
            CROSS APPLY(
                    SELECT DISTINCT NAME, DENSE_RANK() OVER(ORDER BY NAME) [RN],value
                    FROM #TEMP2
                    CROSS APPLY STRING_SPLIT(LTRIM(RTRIM(NAME)),' ')
                    WHERE LTRIM(RTRIM(NAME)) !=''
            )B 
    )A
)A
WHERE A.SM = 1 OR A.FM = 1
Anonymous
  • 440
  • 3
  • 14
  • Request you to include the DDL, sample data and expected output then the logic will be clear and easy to answer for anyone. – Suraj Kumar Jan 12 '20 at 08:08
  • doesn't my example count as sample data? – Anonymous Jan 12 '20 at 08:18
  • It will be better if you can bifurcate/separate your current data in the form of two tables and expected output. – Suraj Kumar Jan 12 '20 at 09:11
  • I've added sample data – Anonymous Jan 12 '20 at 09:46
  • It reminds me of a similar problem I had and I used Full-Text Indexing for it. I know it's frowned upon, but I just wanted to throw the idea out there, if you're willing to try it. Here are a couple of questions I posted in the past about this, and solutions I got [1](https://stackoverflow.com/questions/40378070/fulltext-search-multiple-columns-for-multiple-search-terms) [2](https://stackoverflow.com/questions/40484312/fulltext-show-the-matching-word-in-a-multi-word-search-condition-with-or) [3](https://stackoverflow.com/questions/40406522/fulltext-index-search-has-large-number-of-page-reads) – Radu Gheorghiu Jan 12 '20 at 12:37

2 Answers2

1

You can split the string and aggregate. Assuming that none of the names have duplicate parts:

with n1 as (
      select temp1.name, value as part, count(value) over (partition by name) as num_parts
      from temp1 cross apply
           string_split(temp1.name, ' ')
     ),
     n2 as (
      select temp2.name, value as part, count(value) over (partition by name) as num_parts
      from temp2 cross apply
           string_split(temp2.name, ' ') 
     )
select n1.name, n2.name
from n1 join
     n2
     on n1.part = n2.part and n1.num_parts <= n2.num_parts
group by n1.name, n2.name, n1.num_parts
having count(*) = n1.num_parts;

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, this is a good start but unfortunately a little restrictive. This only works for lines where the names in temp 1 have less terms than the names in temp 2 (so e.g. if you swap the values of Paul Smith with Paul Andrew Smith it won't work). Could of course run it twice with the tables swapped in the join conditions and grouping but I wonder if there's a way to do this in one query. – Anonymous Jan 13 '20 at 06:53
0

Building on from Gordon Linoff's answer this seems to work:

;WITH N1 AS (
      SELECT *,COUNT(*) OVER(PARTITION BY NAME) [NUM_PARTS]
      FROM (
            SELECT DISTINCT NAME, VALUE [PART]
            FROM #TEMP1 CROSS APPLY
                 STRING_SPLIT(#TEMP1.NAME, ' ')
           )A
     ),
     N2 AS (
      SELECT *,COUNT(*) OVER(PARTITION BY NAME) [NUM_PARTS]
      FROM (
            SELECT DISTINCT NAME, VALUE [PART]
            FROM #TEMP2 CROSS APPLY
           STRING_SPLIT(#TEMP2.NAME, ' ')
           )A 
     )
SELECT N1.NAME, N2.NAME
FROM N1 JOIN N2 ON N1.PART = N2.PART
group by n1.name, n2.name, n1.num_parts,n2.num_parts
having count(n2.part) = n1.num_parts
or     count(n1.part) = n2.num_parts
Anonymous
  • 440
  • 3
  • 14