Records from two datasets are compared for fuzzy match string similarity, by using normalized levenshtein distance function and trigram similarity function. 4 different similarity metrics are calculated: LevCmpSimilarity - normalized Levenshtein similarity for compared composite (concatenated) fields, LevWghSimilarity - normalized Levenshtein similarity as summary for all individual fields being compared, TrgWgh and TrgCmp - the same as with Levenshtein, but with Trigram Similarity function instead Levenshtein,
Below are histograms for all four metrics, for frequencies and cummulative frequencies.
absolute frequencies cummulative frequencies
My question is: could these frequencies distribution pattern be used for automatic unsupervised determination of optimum threshold values for record matching acceptance/rejection? If answer is yes, can you suggest direction?
Basically, could levenshtein distance and trigram similarity values frequencies pattern be used solely for infering optimum threshold values for fuzzy match record linkage?