-1

I have two data sets from two different systems being merged together within SQL, however, there is a slight difference within the naming conventions on the two systems. The change in convention is not consistent across the larger data sample but normally requires one modification to match.

System 1 data System 2 data
AA0330 AA_330
AA0340 AA_340
AA0331 AA_331
AA0341 AA-341

I have been using the below Levenshtein distance SQL function too fuzzy match and get the below result, however, end up with duplicate joins. How could I modify my code to mitigate this?

SELECT [System1].[mvmt1],
       [System2].[mvmt2]       
FROM [System1]
left join [System2] 
     ON dbo.ufn_levenshtein([System1].[mvmt1], [System2].[mvmt2]) < 2;

http://www.artfulsoftware.com/infotree/qrytip.php?id=552&m=0

Current output:

System 1 data System 2 data
AA0330 AA_330
AA0330 AA_340
AA0340 AA_340
AA0331 AA_331
AA0341 AA-341

How can I make sure I only get one outcome from the join?

tg00222
  • 3
  • 3
  • then look at the function , include function in the select to see what it returns – eshirvana Mar 31 '21 at 17:53
  • The function just returns the levenshtein relationship as an int. I am then using this within my left join. How would I make the left join only return the first match similar to a vlookup? – tg00222 Mar 31 '21 at 18:03
  • even if with left join you are matching table a and b based on the result of the func , so Levenshtein Distance between AA0330 from table a is less that 2 from (AA_330, AA_340) in table b and that's why you have 2 results there – eshirvana Mar 31 '21 at 18:06
  • Thanks, is there anyway that I could clean the table within the same query to only give back one result? Almost a group by system 1?? – tg00222 Mar 31 '21 at 18:09
  • Levenshtein is not really the right tool for the job, as you've discovered. I also cannot imagine it is very performant. A better bet would be to use, say (based on your sample), the first 2 and last 3 characters. Or you can use nested replaces until one side fits the format of the other. – HoneyBadger Mar 31 '21 at 18:11
  • Why don't you just strip out hyphens and underscores? – shawnt00 Mar 31 '21 at 18:36
  • @shawnt00 It could be an additional character such as 0 or 1. The error is not consistent. For example 7A0330 to 7A330 – tg00222 Apr 01 '21 at 08:59

2 Answers2

0

not the best solution , but you can compare first 2 character and last 3 character , if all the codes are following the same pattern (2 character on right and 3 digit at the end ):

SELECT [System1].[mvmt1],
       [System2].[mvmt2]       
FROM [System1]
inner join [System2] 
     ON left(mvmt1,2) = left(mvmt2,2) 
     and right(mvmt1,3) = right(mvmt2,3)
eshirvana
  • 23,227
  • 3
  • 22
  • 38
-1

what about something like this (sorry about the poor formatting):

WITH Initial_Fuzzy_Join as(
SELECT [System1].[mvmt1],
       [System2].[mvmt2]       ,
dbo.ufn_levenshtein([System1].[mvmt1], [System2].[mvmt2]) as StringDistanceMetric
FROM [System1]
left outer join [System2] 
     ON dbo.ufn_levenshtein([System1].[mvmt1], [System2].[mvmt2]) < 2
)
SELECT mvmt1, mvmt2, max(StringDistanceMetric)
FROM Initial_Fuzzy_Join
Group by mvmt1,mvmt2