I'm trying to link 2 columns (on 2 separate tables) together so that if every word in one column is contained in the other then they will match.
For example, the following values should match:
Paul Smith|Paul Andrew Smith
Paul Smith|Paul Andrew William Smith
Paul William Smith|Paul Andrew William Smith
Paul Andrew Smith|Paul Smith
but the following should NOT be matched:
Paul William Smith|Paul Andrew Smith
I'm using SQL Server 2016.
I'd like to do this with a SELECT
query. I have a vague idea of using the string_split
function (on spaces), cross applying the 2 tables then using the MAX
function but this would create several millions of rows if I'm dealing with just a few thousand names so it wouldn't be very efficient.
Sample data:
DROP TABLE IF EXISTS #TEMP1
DROP TABLE IF EXISTS #TEMP2
CREATE TABLE #TEMP1 (NAME NVARCHAR(300))
CREATE TABLE #TEMP2 (NAME NVARCHAR(300))
INSERT #TEMP1 SELECT 'Paul Smith'
INSERT #TEMP1 SELECT 'Amy Nicholas Stanton'
INSERT #TEMP1 SELECT 'Andrew James Thomas'
INSERT #TEMP2 SELECT 'Paul Andrew Smith'
INSERT #TEMP2 SELECT 'Amy Stanton'
INSERT #TEMP2 SELECT 'Andrew Marcus Thomas'
So from the sample data, the first 2 rows should match and 3 rows shouldn't match.
EDIT: I've put my vague idea into practice, the following solution works but as I expected it's really slow when you're dealing with tables that contain thousands of rows.
SELECT DISTINCT A.[FIRSTNAME],A.[SECONDNAME]
FROM (
SELECT *
,MIN([FIRSTMATCH]) OVER(PARTITION BY [SRN],[FIRSTNAME]) [FM]
,MIN([SECONDMATCH]) OVER(PARTITION BY [FRN],[SECONDNAME]) [SM]
FROM (
SELECT DISTINCT A.NAME [FIRSTNAME]
,B.NAME [SECONDNAME]
,A.value [FIRSTVAL]
,MAX(IIF(A.VALUE=B.VALUE,1,0)) OVER(PARTITION BY A.VALUE,B.RN) [FIRSTMATCH]
,B.value [SECONDVAL]
,MAX(IIF(B.VALUE=A.VALUE,1,0)) OVER(PARTITION BY B.VALUE,A.RN) [SECONDMATCH]
,A.RN [FRN]
,B.RN [SRN]
FROM (
SELECT DISTINCT NAME, DENSE_RANK() OVER(ORDER BY NAME) [RN],value
FROM #TEMP1
CROSS APPLY STRING_SPLIT(LTRIM(RTRIM(NAME)),' ')
WHERE LTRIM(RTRIM(NAME)) !=''
)A
CROSS APPLY(
SELECT DISTINCT NAME, DENSE_RANK() OVER(ORDER BY NAME) [RN],value
FROM #TEMP2
CROSS APPLY STRING_SPLIT(LTRIM(RTRIM(NAME)),' ')
WHERE LTRIM(RTRIM(NAME)) !=''
)B
)A
)A
WHERE A.SM = 1 OR A.FM = 1