2

I am looking to compare two data elements or fields via Fuzzy Match Algorithm for Record Linkage in C#, and I want to determine which algorithm would be best for each comparison.

The fields I am looking to compare are:

  • Last Name
  • First Name
  • Gender
  • Birth Year
  • Birth Month
  • Birth Day
  • SSN
  • Member Number
  • MRN
  • Street Number
  • Street Name
  • Street Type
  • Street Directional
  • City
  • State
  • Zip
  • Phone

The Approximate String Matching Algorithms (ASMs) I am utilizing currently are:

  • Levenshtein Distance
  • Hamming Distance
  • Jaccard Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Longest Common Subsequence
  • Longest Common Substring
  • Overlap Coefficient
  • Ratcliff-Obershelp Similarity
  • Sorensen-Dice Distance
  • Tanimoto Coefficient
  • Damerau-Levenshtein Distance
  • Wagner-Fisher Distance
  • Soundex
  • Metaphone 3
  • NYSIIS

Firstly, I am comparing two fields such as FirstName1 and FirstName2 and seeing if they are an exact match.

For example, FirstName1 = "Bob" and FirstName2 = "Bob" will be an exact match so it will not move on to fuzzy-matching.

On the other hand FirstName1 = "Jill" and FirstName2 = "Bob" will move on to a fuzzy-comparison on the two fields.

I want to know if anyone knows what fuzzy-match algorithm is better to use on certain field comparisons and not others, vice versa.

armatita
  • 12,825
  • 8
  • 48
  • 49

1 Answers1

1

I just wrote some similar code for entity resolution. The key though is that not all fields are created equal. For example, you should not use ASMs on SSN- even one number/character being different is a totally different SSN and person.

Instead of fuzzy matching address components, I would try to resolve the addresses first and then do an exact match. For example, a good address resolution service will treat:

Second Street NW and NW 2nd St

as the same street even though they have very poor similarity by all those metrics.

Likewise, you can use Google's phone number parsing library (available for C#, Java, etc.) to format all phone numbers in a standard way and then do direct comparison.

I did use Jaro-Winkler to compare name components, but I did not research several of the metrics you have listed.

In short:

Canonicalize and compare

instead of fuzzy match.

J. Dimeo
  • 829
  • 8
  • 10