I am looking to compare two data elements or fields via Fuzzy Match Algorithm for Record Linkage in C#
, and I want to determine which algorithm would be best for each comparison.
The fields I am looking to compare are:
- Last Name
- First Name
- Gender
- Birth Year
- Birth Month
- Birth Day
- SSN
- Member Number
- MRN
- Street Number
- Street Name
- Street Type
- Street Directional
- City
- State
- Zip
- Phone
The Approximate String Matching Algorithms (ASMs) I am utilizing currently are:
- Levenshtein Distance
- Hamming Distance
- Jaccard Distance
- Jaro Distance
- Jaro-Winkler Distance
- Longest Common Subsequence
- Longest Common Substring
- Overlap Coefficient
- Ratcliff-Obershelp Similarity
- Sorensen-Dice Distance
- Tanimoto Coefficient
- Damerau-Levenshtein Distance
- Wagner-Fisher Distance
- Soundex
- Metaphone 3
- NYSIIS
Firstly, I am comparing two fields such as FirstName1
and FirstName2
and seeing if they are an exact match.
For example, FirstName1 = "Bob"
and FirstName2 = "Bob"
will be an exact match so it will not move on to fuzzy-matching.
On the other hand FirstName1 = "Jill"
and FirstName2 = "Bob"
will move on to a fuzzy-comparison on the two fields.
I want to know if anyone knows what fuzzy-match algorithm is better to use on certain field comparisons and not others, vice versa.