0

I'm using the Levenshtein distance algorithm to filter through some text in order to determine the best matching result for the purpose of text field auto-completion (and top 5 best results).

Currently, I have an array of strings, and apply the algorithm to each one in an attempt to determine how close of a match it is to the text which was typed by the user. The problem is that I'm not too sure how to interpret the values outputted by the algorithm to effectively rank the results as expected.

For example: (Text typed = "nvmb")

  1. Result: "game" ; levenshtein distance = 3 (best match)
  2. Result: "number the stars" ; levenshtein distance = 13 (second best match)

This technically makes sense; the second result needs many more 'edits', because of it's length. The problem is that the second result is logically and visually a much closer match than the first one. It's almost as if I should ignore any characters longer than the length of the typed text.

Any ideas on how I could achieve this?

Ryan Dias
  • 270
  • 2
  • 11
  • Levenshtein distance probably isn't what you want here. I would suggest something similar to the accepted answer to this question: http://stackoverflow.com/questions/2815083/efficient-data-structure-for-word-lookup-with-wildcards – Jim Mischel Jun 01 '15 at 15:37

2 Answers2

1

Levenshtein distance itself is good for correcting query, not for auto-completion.

I can propose alternative solution:

First, store your strings in prefix tree instead of array, so you will have no need to analyze all of them.

Second, given user input enumerate strings with fixed distance from it and suggest completions for any.

Your example: Text typed = "nvmb"

  1. Distance is 0, no completions
  2. Enumerate strings with distance 1
  3. Only "numb" will have some completions

Another example:Text typed="gamb"

  1. For distance 0 you have only one completion, "gambling", make it first suggestion, and continue to get 4 more
  2. For distance 1 you will get "game" and some completions for it

Of course, this approach sometimes gives more than 5 results, but you can order them by another criterion, not depending on current query.

I think it is more efficient because typically you can limit distance with at maximum two, i.e. check order of 1000*n prefixes, where n is length of input, most times less than number of stored strings.

0

The Levenshtein distance corresponds to the number of single-character insertions, deletions and substitutions in an optimal global pairwise alignment of two sequences if the gap and mismatch costs are all 1.

The Needleman-Wunsch DP algorithm will find such an alignment, in addition to its score (it's essentially the same DP algorithm as the one used to calculate the Levenshtein distance, but with the option to weight gaps, and mismatches between any given pair of characters, arbitrarily). But there are more general models of alignment that allow reduced penalties for gaps at the start or the end (and reduced penalties for contiguous blocks of gaps, which may also be useful here, although it doesn't directly answer the question). At one extreme, you have local alignment, which is where you pay no penalty at all for gaps at the ends -- this is computed by the Smith-Waterman DP algorithm. I think what you want here is in-between: You want to penalise gaps at the start of both the query and test strings, and gaps in the test string at the end, but not gaps in the query string at the end. That way, trailing mismatches cost nothing, and the costs will look like:

Query:    nvmb
Costs:    0100000000000000      =  1 in total
Against:  number the stars

Query:    nvmb
Costs:    1101                  =  3 in total
Against:  game

Query:    number the stars
Costs:    0100111111111111      = 13 in total
Against:  nvmb

Query:       ber     star
Costs:    1110001111100000      =  8 in total
Against:  number the stars

Query:    some numbor
Costs:    111110000100000000000 =  6 in total
Against:       number the stars

(In fact you might want to give trailing mismatches a small nonzero penalty, so that an exact match is always preferred to a prefix-only match.)

The Algorithm

Suppose the query string A has length n, and the string B that you are testing against has length m. Let d[i][j] be the DP table value at (i, j) -- that is, the cost of an optimal alignment of the length-i prefix of A with the length-j prefix of B. If you go with a zero penalty for trailing mismatches, you only need to modify the NW algorithm in a very simple way: instead of calculating and returning the DP table value d[n][m], you just need to calculate the table as before, and find the minimum of any d[n][j], for 0 <= j <= m. This corresponds to the best match of the query string against any prefix of the test string.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169