5

Background: I have a list that contains 13,000 records of human names, some of them are duplicates and I want to find out the similar ones to do the manual duplication process.

For an array like:

["jeff","Jeff","mandy","king","queen"] 

What would be an efficient way to get:

[["jeff","Jeff"]]

Explanation ["jeff","Jeff"] since their Levenshtein distance is 1(which can be variable like 3).

/* 
Working but a slow solution
*/
function extractSimilarNames(uniqueNames) {
  let similarNamesGroup = [];

  for (let i = 0; i < uniqueNames.length; i++) {
    //compare with the rest of the array
    const currentName = uniqueNames[i];

    let suspiciousNames = [];

    for (let j = i + 1; j < uniqueNames.length; j++) {
      const matchingName = uniqueNames[j];
      if (isInLevenshteinRange(currentName, matchingName, 1)) {
        suspiciousNames.push(matchingName);
        removeElementFromArray(uniqueNames, matchingName);
        removeElementFromArray(uniqueNames, currentName);
        i--;
        j--;
      }
    }
    if (suspiciousNames.length > 0) {
      suspiciousNames.push(currentName);
    }
  }
  return similarNamesGroup;
}

I want to find the similarity via Levenshtein distance, not only the lower/uppercase similarity

I already find one of the fastest Levenshtein implementation but it still takes me to 35 mins to get the result of 13000 items list.

Jeff Chung
  • 176
  • 3
  • 13
  • If you just want to remove the duplicates, can you use a data structure such as Set ? By using Set, you could possibly do in linear time complexity. The algorithm you are using now is O(n^2) – royalghost Apr 23 '19 at 04:39
  • Is this just lower/upper case similarity? You are using a distance metric which is much more general. For example *jeff* and *peff* are also levenshtin distance 1. – kabanus Apr 23 '19 at 04:42
  • 1
    Possible duplicate of [Sort an array by the "Levenshtein Distance" with best performance in Javascript](https://stackoverflow.com/questions/11919065/sort-an-array-by-the-levenshtein-distance-with-best-performance-in-javascript) – Kevin Kopf Apr 23 '19 at 05:05
  • Possible duplicate of [Damerau-Levenshtein distance Implementation](https://stackoverflow.com/questions/22308014/damerau-levenshtein-distance-implementation) – Kevin Kopf Apr 23 '19 at 05:11
  • 2
    @alex don't think that a levenshtein implementation that compares two strings would help here. – Jonas Wilms Apr 23 '19 at 06:19
  • 1
    I suspect that the `removeElementFromArray` function is killing your performance because it mutates the array that you're traversing. Remove the 4 lines after `suspiciousNames.push(matchingName);` and test the performance using `console.time` and `console.timeEnd`, preferably on a smaller array to begin with. – Aadit M Shah Apr 23 '19 at 06:54
  • 2
    Related https://cs.stackexchange.com/questions/53299/find-all-pairs-of-strings-in-a-set-with-levenshtein-distance-d – גלעד ברקן Apr 23 '19 at 10:27
  • 1
    What is the expected output for `["Jeff", "eff", "effl"]`? Also, are you only interested in a Levenshtein distance of 1 or could it be variable? – גלעד ברקן Apr 23 '19 at 10:45

5 Answers5

3

Your problem is not the speed of the Levenshtein distance implementation. Your problem is that you have to compare each word with each other word. This means you make 13000² comparisons (and each time calculate the Levenshtein distance).

So my approach would be to try to reduce the number of comparisons.

Here are some ideas:

  • words are only similar if their lengths differ less than 20% (just my estimation)
    → we can group by length and only compare words with other words of length ±20%

  • words are only similar if they share a lot of letters
    → we can create a list of e.g. 3-grams (all lower case) that refer to the words they are part of.
    → only compare (e.g. with Levenshtein distance) a word with other words that have several 3-grams in common with it.

MrSmith42
  • 9,961
  • 6
  • 38
  • 49
  • The main problem is if we have 10 words with lengths [1,2,3,4,5,6,7,8,9,10], how we can cluster them ? With any clustering method, there will 2 clusters which contains words, which lengths will differ only by 1. In this case the only way is to keep all in one group. @MrSmith42, your idea is good, have my vote. but it is not real to implement without data loss. – Gor Apr 23 '19 at 08:06
  • "20%" is quite a random guess, why not just say that only words with a length +-1 get compared? That would be accurate according to the question. – Jonas Wilms Apr 23 '19 at 08:43
  • @Jonas Wilms: 20% is just what I would probably use. The Levenshtein distance will at least be as big as the length-difference, so you can use the maximum tollerable Levenshtein distance for your case as a guidance how to set this maximum acceptable length difference. – MrSmith42 Apr 23 '19 at 15:45
1

Approaches to remove similar names:

  1. Use phonetical representation of the words. cmudict It works with python nltk. You can find which names are close to each other phonetically.
  2. Try different forms of stemming or simplifications. I would try most aggressive stemmers like Porter stemmer.
  3. Levenshtein trie. You can create trie data structure that will help to find word with minimum distance to searched item, this is used for full text search in some search engines. As far as I know it's already implemented in Java. In your case you need to search one item then add it to the structure on every step, you need to make sure that item that you search is not in the structure yet.

  4. Manual naive approach. Find all suitable representations of every word/name, put all representations to map and find representations that have more than 1 word. If you have around 15 different representations of one word you will need only 280K iterations to generate this object (much faster than compare each word to another, which requires around 80M comparisons with 13K names).

-- Edit --

If there is a choice I would use something like Python or Java instead of JS for this. It's only my opinion based on: I don't know all requirements, it's common to use Java/Python for natural language processing, task looks more like heavy data processing than front end.

varela
  • 1,281
  • 1
  • 10
  • 16
  • 2
    I like this answer except the first sentence. Even if I also dislike javascript, there is no real reason to not use it to implement the algorithm. I'm sure you can find implementations of most of the mentioned algorithms also for java script (or they could quite easily be translated to java script). – MrSmith42 Apr 23 '19 at 07:42
  • @MrSmith42 I understand your concern, I understand that there could be other reasons to use java script and I finally like it also. I prefer to keep javascript as a front end language, and if you need to do something that requires heavy data processing it doesn't look like front end. Because of this it's going to have less ready solutions for that case. It's common to use Java and Python for natural language processing and if you have choice why not. – varela Apr 23 '19 at 07:50
  • 1
    This answer could be good ... if it would not contain all those references to other languages but rather focus on concrete algorithms (which work in any language) – Jonas Wilms Apr 23 '19 at 08:41
  • @JonasWilms, sorry for this I start filling myself as javascript racist – varela Apr 23 '19 at 09:34
1

As in your working code you use Levenshtein distance 1 only, I will assume no other distances need to be found.

I will propose a similar solution as Jonas Wilms posted, with these differences:

  • No need to call a isLevenshtein function
  • Produces only unique pairs
  • Each pair is lexically ordered

// Sample data with lots of similar names
const names = ["Adela","Adelaida","Adelaide","Adele","Adelia","AdeLina","Adeline",
               "Adell","AdellA","Adelle","Ardelia","Ardell","Ardella","Ardelle",
               "Ardis","Madeline","Odelia","ODELL","Odessa","Odette"];

const map = {};
const pairs = new Set;
for (const name of names) {
    for (const i in name+"_") { // Additional iteration to NOT delete a character
        const key = (name.slice(0, i) + name.slice(+i + 1, name.length)).toLowerCase();
        // Group words together where the removal from the same index leads to the same key
        if (!map[key]) map[key] = Array.from({length: key.length+1}, () => new Set);
        // If NO character was removed, put the word in EACH group
        for (const set of (+i < name.length ? [map[key][i]] : map[key])) {
            if (set.has(name)) continue;
            for (let similar of set) pairs.add(JSON.stringify([similar, name].sort()));
            set.add(name);
        }
    }
}
const result = [...pairs].sort().map(JSON.parse); // sort is optional
console.log(result);

I tested this on a set of 13000 names, including at least 4000 different names, and it produced 8000 pairs in about 0.3 seconds.

trincot
  • 317,000
  • 35
  • 244
  • 286
0

If we remove one character from "Jeff" at different positions we end up at "eff", "Jff", "Jef" and "Jef". If we do the same with "jeff", we get "eff", "jff", "Jef" and "jef". Now if you look closely, you'll see that both strings produce "eff" as a result, which means that we could create a Map of those combinations to their original version, then for each string generate all combinations and look them up in the Map. Through the lookup, you'll get results that are similar, e.g. "abc" and "cab" but they do not necessarily have a levenshtein distance of 1, so we have to check that afterwards.

Now why is that better?

Well iterating all names is O(n) (n being the number of words), creating all combinations is O(m) (m being the average number of characters in a word) and looking up in a Map is O(1), therefore this runs in O(n * m), whereas your algorithm is O(n * n * m), which means for 10.000 words, mine is 10.000 times faster (or my calculation is wrong :))

  // A "OneToMany" Map
  class MultiMap extends Map {
    set(k, v) {
      if(super.has(k)) {
        super.get(k).push(v);
       } else super.set(k, [v]);
     }
     get(k) {
        return super.get(k) || [];
     }
  }

  function* oneShorter(word) {
    for(let pos = 0; pos < word.length; pos++)
       yield word.substr(0, pos) + word.substr(pos + 1);
  }

  function findDuplicates(names) {
    const combos = new MultiMap();
    const duplicates = [];

    const check = (name, combo) => {
      const dupes = combos.get(combo);
      for(const dupe of dupes) {
         if((isInLevenshteinRange(name, combo, 1))
         duplicates.push([name, dupe]);
      }
      combos.set(combo, name);
    };

    for(const name of names) {
      check(name, name);

      for(const combo of oneShorter(name)) {
         check(name, combo);
      }
    }

     return duplicates;
 }
Jonas Wilms
  • 132,000
  • 20
  • 149
  • 151
  • But fails for `fn("Jeff ", "Jff")` – Kaiido Apr 23 '19 at 06:35
  • @kaiido what if I would repeat that and omit one character instead of replacing with an underscore? – Jonas Wilms Apr 23 '19 at 06:39
  • You would still have no matches between "Jeff" and "Jff" at index 1. Also note that OP used index 1 in their example, but they may also want to leverage a bit the filter and go as deep as level 2 or 3. Doing so, I'm not sure your algorithm would perform any better than a true levenshtein – Kaiido Apr 23 '19 at 06:43
  • @kaiido It would if I add the name itself to the map. Removing one from results in "eff", "Jff", "Jef". And yes, this is probably only better for small levenshtein distances – Jonas Wilms Apr 23 '19 at 06:45
0

I have yet a completely different way of approaching this problem, but I believe I am presenting a pretty fast (but debatable as to how correct/incorrect) it is. My approach is to map the strings to numeric values, sort those values once, and then run through that list once, comparing neighboring values to each other. Like this:

// Test strings (provided by OP) with some additions
var strs = ["Jeff","mandy","jeff","king","queen","joff", "Queen", "jff", "tim", "Timmo", "Tom", "Rob", "Bob"] 

// Function to convert a string into a numeric representation
// to aid with string similarity comparison
function atoi(str, maxLen){
  var i = 0;
  for( var j = 0; j < maxLen; j++ ){
    if( str[j] != null ){
      i += str.toLowerCase().charCodeAt(j)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    } else {
      // Normalize the string with a pad char
      // up to the maxLen (update the value, but don't actually
      // update the string...)
      i += '-'.charCodeAt(0)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    }
  }
  valMap.push({
     str,
     i 
  })
  return i;
}

Number.prototype.inRange = function(min, max){ return(this >= min && this <= max) }

var valMap = []; // Array of string-value pairs

var maxLen = strs.map((s) => s.length).sort().pop() // maxLen of all strings in the array
console.log('maxLen', maxLen)
strs.forEach((s) => atoi(s, maxLen)) // Map strings to values

var similars = [];
var subArr = []
var margin = 0.05;
valMap.sort((a,b) => a.i > b.i ? 1 : -1) // Sort the map...
valMap.forEach((entry, idx) => {  
  if( idx > 0 ){
      var closeness = Math.abs(entry.i / valMap[idx-1].i);
      if( closeness.inRange( 1 - margin, 1 + margin ) ){
        if( subArr.length == 0 ) subArr.push(valMap[idx-1].str)
        subArr.push(entry.str)
        if( idx == valMap.length - 1){
          similars.push(subArr)
        }
      } else {
        if( subArr.length > 0 ) similars.push(subArr)
        subArr = []
      }
  }
})
console.log('similars', similars)

I'm treating each string as if each were a "64-bit number", where each "bit" could take on the alphanumeric values, with 'a' representing 0. Then I sort that once. Then, if similar values are encountered to the previous one (i.e., if the ratio of the two is near 1), then I deduce I have similar strings.

The other thing I do is check for the max string length, and normalize all the strings to that length in the calculation of the "64-bit value".

--- EDIT: even more stress testing --- And yet, here is some additional testing, which pulls a large list of names and performs the processing rather quickly (~ 50ms on 20k+ names, with lots of false positives). Regardless, this snippet should make it easier to troubleshoot:

var valMap = []; // Array of string-value pairs

/* Extensions */
Number.prototype.inRange = function(min, max){ return(this >= min && this <= max) }

/* Methods */
// Function to convert a string into a numeric representation
// to aid with string similarity comparison
function atoi(str, maxLen){
  var i = 0;
  for( var j = 0; j < maxLen; j++ ){
    if( str[j] != null ){
      i += str.toLowerCase().charCodeAt(j)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    } else {
      // Normalize the string with a pad char
      // up to the maxLen (update the value, but don't actually
      // update the string...)
      i += '-'.charCodeAt(0)*Math.pow(64,maxLen-j) - 'a'.charCodeAt(0)*Math.pow(64,maxLen-j)
    }
  }
  valMap.push({ str, i })
  return i;
}

function findSimilars(strs){
  var maxLen = strs.map((s) => s.length).sort().pop() // maxLen of all strings in the array
  console.log('maxLen', maxLen)
  strs.forEach((s) => atoi(s, maxLen)) // Map strings to values

  var similars = [];
  var subArr = []
  var margin = 0.05;
  valMap.sort((a,b) => a.i > b.i ? 1 : -1) // Sort the map...
  valMap.forEach((entry, idx) => {  
    if( idx > 0 ){
        var closeness = Math.abs(entry.i / valMap[idx-1].i);
        if( closeness.inRange( 1 - margin, 1 + margin ) ){
          if( subArr.length == 0 ) subArr.push(valMap[idx-1].str)
          subArr.push(entry.str)
          if( idx == valMap.length - 1){
            similars.push(subArr)
          }
        } else {
          if( subArr.length > 0 ) similars.push(subArr)
          subArr = []
        }
    }
  })
  console.log('similars', similars)
}

// Stress test with 20k+ names 
$.get('https://raw.githubusercontent.com/dominictarr/random-name/master/names.json')
.then((resp) => {
  var strs = JSON.parse(resp);
  console.time('processing')
  findSimilars(strs)
  console.timeEnd('processing')
})
.catch((err) => { console.err('Err retrieving JSON'); })
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

(For some reason, when I run this in JSFiddle, I get it to run in ~50ms, but in the Stackoverflow snippet, it's closer to 1000ms.)

RichS
  • 913
  • 4
  • 12
  • Very good, but that makes first chars much more important to last. So sibling items will always match only neighbors in alphabet order sorted list. – varela Apr 23 '19 at 09:48
  • No, this won't work. "abc" and "zbc" are far away in your sorting, but the levenshtein distance is 1. – Jonas Wilms Apr 23 '19 at 15:48
  • This is a particularly interesting question in the way it's worded. If the strings are to be considered *names*, then my question would be: do we want "abc" and "zbc" to be grouped as similar? Likewise, if we have Mo or Bo? Or, what if we have really close names like "Tim" and "Tom". Very similar *strings* but completely different *names*. Or "Rob" vs Bob"? The English language is very tricky sometimes. – RichS Apr 23 '19 at 18:57
  • I acknowledge this in my answer that it can be argued as to how accurate my results will be, and it seems like there will ultimately be a trade-off between speed of the algorithm and accuracy. If we're talking *names* as a special set of strings, then this is the reason why I add the weighting to the characters in the way I did. – RichS Apr 23 '19 at 18:58