I have written a program that looks for probable duplicate records in a list of records. The current version of my program recognized 94 percent of the duplicates found by a manual check of the data, though if possible I need to get that up to 100 percent, while minimizing false positives. This has proved tricky.
My program takes as input a spreadsheet with four relevant columns each of length n. First the record name, followed by three string attributes. For each attribute column, I compare each element with each other element by comparing their Levenshtein distances and forming an nxn matrix that contains each distance. Then I weed out pairs of elements in the matrix that are insufficiently similar by only selecting pairs whose distance is below a chosen threshold (the threshold is different for each column because some attributes have more or less variance in their instances). Two records are then said to be probable duplicates if at least one of their corresponding pairs of attributes is a probable duplicate.
Sample output: Kang,[Kor, Koloth, Martok]
where the record Kang is a probable duplicate of Kor, Koloth, and Martok because at least one of their associated fields is similar to that of Kang
e.g. bat'leth,[mek'leth,d'k tahg, mek'leth]
I could easily tweak the Levenshtein distance threshold to detect a higher portion of duplicates, but it seems that raising them higher than they already are yields diminishing marginal recognitions and dramatically increases the number of false positives, as there are a small number of duplicates whose attributes are very dissimilar (not sure I can post examples; it is work sensitive). I have considered alternate solutions like using other string similarity functions besides Levenshtein's or even a Naïve Byes classifier. How do you think I could best improve the accuracy of my program?