3

I have a problem where I need to do a fuzzy lookup in a hash map, i.e. return the value corresponding to that key that most closely resembles the query, in my case measured by the Levenshtein distance.

My current approach is to subclass dict with a special lookup method that computes the Levenshtein distance against all keys, then returns the value of the key with the lowest score. Basically this:

import Levenshtein

class FuzzyLookupDict(dict):

    def fuzzy_lookup(self, query):
        levs = [(key, Levenshtein.ratio(query, key)) for key in self.keys()]
        key, score = max(levs, key=lambda lev: lev[1])
        return self.get(key)

Is this a good approach or is there a better solution that I haven't thought of?

gmolau
  • 2,815
  • 1
  • 22
  • 45
  • Unless you can figure out a clever way to index your keys in an extra table, I don't think it is possible to do this without searching through all your keys. – Beefster Dec 28 '17 at 22:33

1 Answers1

5

This problem is normally solved with Levenshtein automata. A Levenshtein automaton for a string w and a number n is a finite state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n.

This algorithm is significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word.

Jule Jacob's blog post Levenshtein automata can be simple and fast is a good starting point and Nick Johnsonz's Damn Cool Algorithms: Levenshtein Automata is a more in depth intro.

You can find some Python implementations on Github, for example https://github.com/antoinewdg/pyffs.