0

Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.

At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.

In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.

While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.

Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?

UPDATE #1

I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...

UPDATE #2

Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.

2 Answers2

2

BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.

Frédéric Dumont
  • 958
  • 13
  • 19
  • The [BK-tree](https://en.wikipedia.org/wiki/BK-tree) seems appropriate here, thanks for your suggestion. It led me to learn about metric trees in general, including the [VP-tree](https://en.wikipedia.org/wiki/Vantage-point_tree) which also looks appropriate. –  Feb 05 '15 at 22:28
  • Thanks, I've successfully used the bk-tree implementation you recommended! –  Feb 06 '15 at 01:07
0

Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:

Damn Cool Algorithms: Levenshtein Automata

As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.

qqilihq
  • 10,794
  • 7
  • 48
  • 89
  • 1
    BK trees, described on the same blog, are a solution to this problem. There are Java implementations, for instance this [one](https://github.com/gtri/bk-tree) (not tried). – Frédéric Dumont Feb 05 '15 at 09:08