4

On this page levenshtein(), I am using the example #1 with following variables:

// input misspelled word
$input = 'htc corporation';

// array of words to check against
$words = array('htc', 'Sprint Nextel', 'Sprint', 'banana', 'orange',
        'radish', 'carrot', 'pea', 'bean');

Could someone please tell me why the expected result is carrot rather than htc? Thanks

lomse
  • 4,045
  • 6
  • 47
  • 68
  • 1
    I figure because your array already contains the exact word `htc`, that it is looking for something similar to your input string, so it's skipping `htc` because it didn't find anything else close to it, therefore `corporation` is closer to `carrot`. Least, that's my take on it and I never had to deal with this, yet alone knowing about the function until today. – Funk Forty Niner Aug 02 '13 at 15:52
  • This is why you can't use Levenshtein as such to implement a fuzzy search. – JJJ Aug 02 '13 at 15:54
  • 1
    @Fred The algorithm doesn't "skip" anything. The distance is just greater with "htc" than with "carrot". – JJJ Aug 02 '13 at 15:55
  • Have you had a look at [soundex()](http://php.net/manual/en/function.soundex.php) , [metaphone()](http://php.net/manual/en/function.metaphone.php) and http://stackoverflow.com/a/5430851/2493918? Also have a look at [Levenshtein distance](http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#PHP) – Markus Hofmann Aug 02 '13 at 15:56
  • @Juhana Thanks for the clarification. Have a look at my edited comment, it will explain everything about "my take on it", cheers :) – Funk Forty Niner Aug 02 '13 at 15:56
  • You could make the result be `htc` by making insertions/deletions cheaper than replacements: `levenshtein('htc corporation', 'carrot', 1, 2, 1); // 13` If you are searching for a substring you may even want to make deletions free. – Paul Aug 02 '13 at 16:06

3 Answers3

4

Because the levenshtein distance from htc corporation is 12 whereas the distance to carrot is only 11.

The levenshtein function calculates how many characters it has to add or replace to get to a certain word, and because htc corporation has 12 extra characters than htc it has to remove 12 to get to just htc. To get to the word carrot from htc corporation it takes 11 changes.

JJJ
  • 32,902
  • 20
  • 89
  • 102
Novocaine
  • 4,692
  • 4
  • 44
  • 66
3

"htc corporation" to "htc" has a distance of 12 (remove " corporation" = 12 characters). "htc corporation" to "carrot" has a distance of no more than 11.

"htc corporation" => "corporation": 4
"corporation" => "corporat": 3
"corporat" => "corrat": 2
"corrat" => "carrat": 1
"carrat" => "carrot": 1

4 + 3 + 2 + 1 + 1 = 11

It looks like what you might be looking for isn't straight-up levenshtein distance, but a "closest substring" match. There's an example implementation of such a thing using a modified Levenshtein algorithm here. Using this algorithm gives scores of:

htc: 0
Sprint Nextel: 11
Sprint: 4
banana: 5
orange: 3
radish: 3
carrot: 3
pea: 2
bean: 3

which recognizes "htc" as an exact substring match and gives it a score of zero. The runner-up, "pea", has a score of two, because you could align it with the "p", the "e", or the "a" in corporation, and then replace the other two characters, etc. When working with this algorithm you should be aware that the score will never be higher than the length of the "needle" string, so shorter strings will generally get lower scores (they're "easier to match").

hobbs
  • 223,387
  • 19
  • 210
  • 288
2

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

here is a simple analysis

$input = 'htc corporation';

// array of words to check against
$words = array(
    'htc',
    'Sprint Nextel',
    'Sprint',
    'banana',
    'orange',
    'radish',
    'carrot',
    'pea',
    'bean' 
);

foreach ( $words as $word ) {

    // Check for Intercept
    $ic = array_intersect(str_split($input), str_split($word));

    printf("%s \t l= %s , s = %s , c = %d \n",$word ,  
    levenshtein($input, $word), 
    similar_text($input, $word), 
    count($ic));
}

Output

htc      l= 12 , s = 3 , c = 5 
Sprint Nextel    l= 14 , s = 3 , c = 8 
Sprint   l= 12 , s = 1 , c = 7 
banana   l= 14 , s = 2 , c = 2 
orange   l= 12 , s = 4 , c = 7 
radish   l= 12 , s = 3 , c = 5 
carrot   l= 11 , s = 1 , c = 10  
pea      l= 13 , s = 2 , c = 2 
bean     l= 13 , s = 2 , c = 2 

It clear htc has a distance of 12 while carrot has 11 if you want htc then Levenshtein alone is not enough .. you need to compare exact word then set priorities

Baba
  • 94,024
  • 28
  • 166
  • 217