Background: I have a list that contains 13,000 records of human names, some of them are duplicates and I want to find out the similar ones to do the manual duplication process.
For an array like:
["jeff","Jeff","mandy","king","queen"]
What would be an efficient way to get:
[["jeff","Jeff"]]
Explanation ["jeff","Jeff"]
since their Levenshtein distance is 1(which can be variable like 3).
/*
Working but a slow solution
*/
function extractSimilarNames(uniqueNames) {
let similarNamesGroup = [];
for (let i = 0; i < uniqueNames.length; i++) {
//compare with the rest of the array
const currentName = uniqueNames[i];
let suspiciousNames = [];
for (let j = i + 1; j < uniqueNames.length; j++) {
const matchingName = uniqueNames[j];
if (isInLevenshteinRange(currentName, matchingName, 1)) {
suspiciousNames.push(matchingName);
removeElementFromArray(uniqueNames, matchingName);
removeElementFromArray(uniqueNames, currentName);
i--;
j--;
}
}
if (suspiciousNames.length > 0) {
suspiciousNames.push(currentName);
}
}
return similarNamesGroup;
}
I want to find the similarity via Levenshtein distance, not only the lower/uppercase similarity
I already find one of the fastest Levenshtein implementation but it still takes me to 35 mins to get the result of 13000 items list.