1

Records from two datasets are compared for fuzzy match string similarity, by using normalized levenshtein distance function and trigram similarity function. 4 different similarity metrics are calculated: LevCmpSimilarity - normalized Levenshtein similarity for compared composite (concatenated) fields, LevWghSimilarity - normalized Levenshtein similarity as summary for all individual fields being compared, TrgWgh and TrgCmp - the same as with Levenshtein, but with Trigram Similarity function instead Levenshtein,

Below are histograms for all four metrics, for frequencies and cummulative frequencies.

absolute frequencies cummulative frequencies

My question is: could these frequencies distribution pattern be used for automatic unsupervised determination of optimum threshold values for record matching acceptance/rejection? If answer is yes, can you suggest direction?

Basically, could levenshtein distance and trigram similarity values frequencies pattern be used solely for infering optimum threshold values for fuzzy match record linkage?

zlatko
  • 596
  • 1
  • 6
  • 23
  • Later on I have created plpgsql function that combines levenshtein similarity and trigram similarity function into one common similarity function. This reduced complexity, but the main question remains - how to determine optimum similarity threshold according to the frequencies distribution (histogram)? Is it even possible? – zlatko May 07 '16 at 06:40

0 Answers0