Is the PHP levenshtein() function buggy?

Question

On this page levenshtein(), I am using the example #1 with following variables:

// input misspelled word
$input = 'htc corporation';

// array of words to check against
$words = array('htc', 'Sprint Nextel', 'Sprint', 'banana', 'orange',
        'radish', 'carrot', 'pea', 'bean');

Could someone please tell me why the expected result is carrot rather than htc? Thanks

I figure because your array already contains the exact word `htc`, that it is looking for something similar to your input string, so it's skipping `htc` because it didn't find anything else close to it, therefore `corporation` is closer to `carrot`. Least, that's my take on it and I never had to deal with this, yet alone knowing about the function until today. — Funk Forty Niner, Aug 02 '13 at 15:52
This is why you can't use Levenshtein as such to implement a fuzzy search. — JJJ, Aug 02 '13 at 15:54
@Fred The algorithm doesn't "skip" anything. The distance is just greater with "htc" than with "carrot". — JJJ, Aug 02 '13 at 15:55
Have you had a look at [soundex()](http://php.net/manual/en/function.soundex.php) , [metaphone()](http://php.net/manual/en/function.metaphone.php) and http://stackoverflow.com/a/5430851/2493918? Also have a look at [Levenshtein distance](http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#PHP) — Markus Hofmann, Aug 02 '13 at 15:56
@Juhana Thanks for the clarification. Have a look at my edited comment, it will explain everything about "my take on it", cheers :) — Funk Forty Niner, Aug 02 '13 at 15:56
You could make the result be `htc` by making insertions/deletions cheaper than replacements: `levenshtein('htc corporation', 'carrot', 1, 2, 1); // 13` If you are searching for a substring you may even want to make deletions free. — Paul, Aug 02 '13 at 16:06

score 4 · Answer 1 · edited Aug 02 '13 at 16:00

4

Because the levenshtein distance from htc corporation is 12 whereas the distance to carrot is only 11.

The levenshtein function calculates how many characters it has to add or replace to get to a certain word, and because htc corporation has 12 extra characters than htc it has to remove 12 to get to just htc. To get to the word carrot from htc corporation it takes 11 changes.

edited Aug 02 '13 at 16:00

JJJ

32,902
20
89
102

answered Aug 02 '13 at 15:52

Novocaine

4,692
4
44
66

1

I got 11 too, but there could be a better way. – Paul Aug 02 '13 at 15:57
corporation has 11 letters, plus a space = 12 – Novocaine Aug 02 '13 at 15:57
@Novocaine88 We're talking about the distance between "carrot" and "htc corporation". Remove 9 characters to get "corpot" and change 2 characters to get to "carrot" – JJJ Aug 02 '13 at 15:58
ah ok, I wasn't sure about the calc to get to carrot. – Novocaine Aug 02 '13 at 15:59
I removed `9` characters to get `c crot` and then replaced the space and `c` to get `carrot` :) – Paul Aug 02 '13 at 15:59
It's definitely not 6 though, as just to get the strings to be the same length you need at least `9` operations. – Paul Aug 02 '13 at 16:00

hobbs · Answer 2 · 2013-08-02T16:29:52.437

"htc corporation" to "htc" has a distance of 12 (remove " corporation" = 12 characters). "htc corporation" to "carrot" has a distance of no more than 11.

"htc corporation" => "corporation": 4
"corporation" => "corporat": 3
"corporat" => "corrat": 2
"corrat" => "carrat": 1
"carrat" => "carrot": 1

4 + 3 + 2 + 1 + 1 = 11

It looks like what you might be looking for isn't straight-up levenshtein distance, but a "closest substring" match. There's an example implementation of such a thing using a modified Levenshtein algorithm here. Using this algorithm gives scores of:

htc: 0
Sprint Nextel: 11
Sprint: 4
banana: 5
orange: 3
radish: 3
carrot: 3
pea: 2
bean: 3

which recognizes "htc" as an exact substring match and gives it a score of zero. The runner-up, "pea", has a score of two, because you could align it with the "p", the "e", or the "a" in corporation, and then replace the other two characters, etc. When working with this algorithm you should be aware that the score will never be higher than the length of the "needle" string, so shorter strings will generally get lower scores (they're "easier to match").

score 2 · Accepted Answer · answered Aug 02 '13 at 15:59

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

here is a simple analysis

$input = 'htc corporation';

// array of words to check against
$words = array(
    'htc',
    'Sprint Nextel',
    'Sprint',
    'banana',
    'orange',
    'radish',
    'carrot',
    'pea',
    'bean' 
);

foreach ( $words as $word ) {

    // Check for Intercept
    $ic = array_intersect(str_split($input), str_split($word));

    printf("%s \t l= %s , s = %s , c = %d \n",$word ,  
    levenshtein($input, $word), 
    similar_text($input, $word), 
    count($ic));
}

Output

htc      l= 12 , s = 3 , c = 5 
Sprint Nextel    l= 14 , s = 3 , c = 8 
Sprint   l= 12 , s = 1 , c = 7 
banana   l= 14 , s = 2 , c = 2 
orange   l= 12 , s = 4 , c = 7 
radish   l= 12 , s = 3 , c = 5 
carrot   l= 11 , s = 1 , c = 10  
pea      l= 13 , s = 2 , c = 2 
bean     l= 13 , s = 2 , c = 2

It clear htc has a distance of 12 while carrot has 11 if you want htc then Levenshtein alone is not enough .. you need to compare exact word then set priorities

Is the PHP levenshtein() function buggy?

3 Answers3