Finding the most similar string among a set of millions of strings

Question

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.

So let's say my query is elepant, then the result would most likely be elephant.

If my word is fentist, the result will probably be dentist.

Of course assuming both elephant and dentist are present in my initial word list.

What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).

What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.

As I am no expoert in this topic I just post a comment, as far as my knowledge goes what you might need is something like calculating the [string distance](https://en.wikipedia.org/wiki/String_metric) from the entered word to all other words. One way of achieving this is to calculate the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Example implementations already exists [here](https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance). I have used it once but at that time it was supported in the database directly - that was nice ;-) — Markus Safar, Dec 27 '18 at 19:46
Ok - just saw your update which was not there when I started writing my comment ;-) — Markus Safar, Dec 27 '18 at 19:47
@MarkusSafar Well, this function could prove to be useful for when designing the index. Maybe some of it can be used as inspiration for the index. Thanks :) — Chris Vilches, Dec 27 '18 at 19:48
Have you thought about computing the distance to a given constant? like each word on your list has a known distance to a constant word like `stackoverflow` then you can compute the given word's distance to `stackoverflow` and reduce your search to a smaller pool of words. Example: `elepant` to `stackoverflow` is a Levenshtein distance of 12. `elephant` is also a distance of 12 — Nifim, Dec 27 '18 at 19:53
I am not aware of any _language_ (in the proper sense) which would have millions upon millions of words. I suspect that the strings you are trying to compare are coming from the different domain, with its own restrictions on the alphabet and string lengths. It could be very helpful to see those restrictions (if they indeed exist). — user58697, Dec 27 '18 at 22:05
@JimMischel Well, in this case let's assume the Levenshtein distance is good enough, for simplicity. — Chris Vilches, Dec 28 '18 at 16:51

score 4 · Accepted Answer · answered Dec 29 '18 at 01:42

The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.

If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.

If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.

(You can compute the Levenshtein distance with Hirschberg's algorithm)

Anatoli Klamer · Answer 2 · 2018-12-28T23:23:20.087

I made similar algorythm some time ago

Idea is to have an array char[255] with characters and values is a list of words hashes (word ids) that contains this character

When you are searching 'dele....' search(d) will return empty list search(e) will find everything with character e, including elephant (two times, as it have two 'e') search(l) will brings you new list, and you need to combine this list with results from previous step

... at the end of input you will have a list then you can try to do group by wordHash and order by desc by count

Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea

My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position. My main problem was that i want to search with ee to find 'elephant' with eleant to find 'elephant' with antph to find 'elephant' Every words was actually a line from file, so it's often was very long And number of files and lines was big I wanted quick search for directories with more than 1gb text files So it was a problem even store them in memory, for this idea you need 3 parts function to fill your cache function to find by char from input function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )

I hope it make sense

Finding the most similar string among a set of millions of strings

2 Answers2

Linked