Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

Question

I came across a forum post that describes a method of creating a Python UDF in Redshift: https://community.periscopedata.com/r/y715m2.

More info about Python UDFs in Redshift: https://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html

I checked a number of outputs by the function (like select public.levenshtein('walk', 'cake'))- and it works quite well.

I am hoping to use this concept for fuzzy matching in joins between two tables on t1.first_name+last_name = t2.first_name+last_name.

Is anyone familiar with a "magical range" (or can suggest something from experience) in which a record should fall between to be deemed a likely match? ie. what should the min and max levenshtein (s,t) be to be considered a likely match.

score 1 · Accepted Answer · answered Feb 09 '18 at 17:03

1

It depends on your particular case. Think of it as a simple machine learning problem when you provide a training dataset - you can run the function against your data to see the values for different kinds of pairs and set your range based on that. If you're matching names the cost of error is quite high, both for false negative (no match for the same person) and false positive (match for different people) cases, so I would go with soundex rather than leuvenstein. AFAIK Leuvenstein distance would be equal to one for very different last names if they are different only in one letter but this can be two cases - when last names are actually the same but spelled differently, or when the last names are actually different but the difference is one letter. Soundex is better for distinguishing such cases.

answered Feb 09 '18 at 17:03

AlexYes

4,088
2
15
23

thank you; Would I need a UDF for soundex like in Levenstein's case? any information about implementing soundex in Redshift? – user8834780 Feb 09 '18 at 17:07
try this https://stackoverflow.com/questions/35403335/is-there-a-soundex-function-for-python – AlexYes Feb 09 '18 at 17:32
I am familiar with fuzzy for Python; any way to know the code that actually creates the package? because I would need that in order to create a user defined function in Redshift – user8834780 Feb 09 '18 at 17:34
try this https://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html#udf-importing-custom-python-library-modules – AlexYes Feb 09 '18 at 17:38
ie. download the package and upload as custom library? wow didnt know it was possible – user8834780 Feb 09 '18 at 17:41

Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

1 Answers1