0

I am trying to find the Levenshtien distance between the columns of two different tables TableA and TableB. Basically I need to match ColumnA of TableA with all the elements of ColumnB in TableB and find the Levenshtien Distance

I have created a Levenshtien Function as follows

CREATE FUNCTION [Levenshtein]
     (@value1 [NVARCHAR](MAX), 
      @value2 [NVARCHAR](MAX))
RETURNS [INT]
AS 
EXTERNAL NAME [FastenshteinAssembly].[Fastenshtein.Levenshtein].[Distance]
GO

This is basically calling a Levenshtien dll I have on my machine. I tried creating a stored procedure for this operation but I am unsure whether that's an optimized approach or not.

Table B contains millions of CompanyNames and TableA also consists of thousand of CompanyNames so this operation would be essentially a (n*m) operation.

Whats the optimized approach of achieving this.

Thanks

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

1 Answers1

2

There is no optimized approach for doing this.

There may be some hacks that you can do to simplify the processing. For instance, you could create lookup tables on each side using n-grams and only compare names whose ngrams are close. Or, you could use soundex() for the same purpose -- or the first three characters.

However, if you need to match to all possibilities, then this is an expensive n*m operation in SQL Server.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 1
    Thanks Gordon, I had tried doing it C# initially, I would take a batch from TableA and a batch from TableB and then use LINQ projections and make a cross-product of these values and call the Levenshtien function parallely but after running for 10-20 minutes I get a thread abort exception. I thought doing in SQL would make it a way better. –  Mar 02 '19 at 15:06
  • 1
    If I introduce a threshold to only get those values from TableB which have a edit distance of 5. Can it make it faster –  Mar 02 '19 at 15:11
  • @user5593950 . . . Not at all. All the values still have to be calculated. You need to do some preliminary work to reduce the number of matches -- ngrams comes to mind. For shortish strings, a distance of 5 is pretty big. For instance, "Denver" and "Dallas" are a distance of 5. "France" and "Fresno" are a distance of 4. "RUSSIA" and "USSR" are a distance of 3. – Gordon Linoff Mar 02 '19 at 16:26