8

I am trying to utilize k-nearest neighbors for the string similarity problem i.e. given a string and a knowledge base, I want to output k strings that are similar to my given string. Are there any tutorials that explain how to utilize kd-trees to efficiently do this k-nearest neighbor lookup for strings? The string length will not exceed more than 20 characters.

0x90
  • 39,472
  • 36
  • 165
  • 245
Legend
  • 113,822
  • 119
  • 272
  • 400
  • What's your similarity metric between 2 strings ? [scipy.spatial.cKDtree](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html) is fast and solid, good for 20d, but does only Lp metrics. – denis Apr 19 '11 at 15:25

1 Answers1

8

Probably one of the hottest blog posts I had read a year or so ago: Levenstein Automata. Take a look at that article. It provides not only a description of the algorithm but also code to follow. Technically, it's not a kd-tree but it's quite related to the string matching and dictionary correction algorithms one might encounter/use in the real world.

He also has another blog post about BK-trees which are much better at the fuzzy matching for strings and string look ups where there are mispellings. Here is another resource containing source code for a BK-tree (this one I can't verify the accuracy or proper implementation.)

wheaties
  • 35,646
  • 15
  • 94
  • 131
  • 1
    the Levenshtein Automata is impressive, however, having implemented it, I can only say that the precomputed version quickly explodes (in term of nodes) when the distance grow up. In practice, it's blazing fast to search in a Trie, but the automaton starts becoming really big for a distance of 4 and upwards. – Matthieu M. Apr 18 '11 at 16:39
  • 1
    @Matthieu M. what would you recommend instead? – wheaties Apr 18 '11 at 17:21
  • 1
    I don't have implemented (seriously) any other mechanism, so I don't have any recommendation. If you can live with a maximum distance of `3`, then do use it, otherwise, you'll have to explore on your own, I am afraid :) – Matthieu M. Apr 18 '11 at 18:46
  • @MatthieuM. what about some nested automatas to allow for example edit distance of up to 3+3. – 0x90 Dec 11 '20 at 15:21