Questions tagged [levenshtein-distance]

A metric for measuring the amount of difference between two sequences. The Levenshtein distance allows deletion, insertion and substitution.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance is a specific algorithm of edit distance algorithms.

References:
Wikipedia
RosettaCode
Edit Distance (Wikipedia)
Hirschberg's algorithm (Wikipedia)

967 questions
44
votes
4 answers

How python-Levenshtein.ratio is computed

According to the python-Levenshtein.ratio source: https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722 it's computed as (lensum - ldist) / lensum. This works for # pip install python-Levenshtein import…
cjauvin
  • 3,433
  • 4
  • 29
  • 38
43
votes
2 answers

Compare similarity algorithms

I want to use string similarity functions to find corrupted data in my database. I came upon several of them: Jaro, Jaro-Winkler, Levenshtein, Euclidean and Q-gram, I wanted to know what is the difference between them and in what situations…
Ali
  • 808
  • 2
  • 11
  • 20
39
votes
11 answers

Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

UPDATE 3 Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo's modified version of Steve Hanov's algorithm. Thanks to all that helped! /** * Computes the minimum Levenshtein Distance between…
Hristo
  • 45,559
  • 65
  • 163
  • 230
37
votes
4 answers

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a…
36
votes
9 answers

Levenshtein distance: how to better handle words swapping positions?

I've had some success comparing strings using the PHP levenshtein function. However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings. For example: levenshtein("The quick brown…
thomasrutter
  • 114,488
  • 30
  • 148
  • 167
34
votes
4 answers

Fast Levenshtein distance in R?

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this.
mbq
  • 18,510
  • 6
  • 49
  • 72
34
votes
3 answers

Best machine learning technique for matching product strings

Here's a puzzle... I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the product names are not always identical. I've tried using the Levenshtein distance for…
33
votes
4 answers

How to add levenshtein function in mysql?

I got the code for Levenshtein distance for MySQL from http://kristiannissen.wordpress.com/2010/07/08/mysql-levenshtein/(archive.org link), but how to add that function in MySQL? I am using XAMPP and I need it for search in PHP.
Sandesh Sharma
  • 1,064
  • 5
  • 16
  • 32
32
votes
12 answers

How can I optimize this Python code to generate all words with word-distance 1?

Profiling shows this is the slowest segment of my code for a little word game I wrote: def distance(word1, word2): difference = 0 for i in range(len(word1)): if word1[i] != word2[i]: difference += 1 return…
Davy8
  • 30,868
  • 25
  • 115
  • 173
32
votes
5 answers

Improving search result using Levenshtein distance in Java

I have the following working Java code for searching for a word against a list of words and it works perfectly and as expected: public class Levenshtein { private int[][] wordMartix; public Set similarExists(String searchWord) { …
Maytham Fahmi
  • 31,138
  • 14
  • 118
  • 137
31
votes
8 answers

Levenshtein: MySQL + PHP

$word = strtolower($_GET['term']); $lev = 0; $q = mysql_query("SELECT `term` FROM `words`"); while($r = mysql_fetch_assoc($q)) { $r['term'] = strtolower($r['term']); $lev = levenshtein($word, $r['term']); if($lev >= 0 && $lev <…
user317005
31
votes
6 answers

Most efficient way to calculate Levenshtein distance

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the…
efficiencyIsBliss
  • 3,043
  • 7
  • 38
  • 44
29
votes
6 answers

Similarity Score - Levenshtein

I implemented the Levenshtein algorithm in Java and am now getting the corrections made by the algorithm, a.k.a. the cost. This does help a little but not much since I want the results as a percentage. So I want to know how to calculate those…
N00programmer
  • 1,111
  • 4
  • 13
  • 17
29
votes
8 answers

Percentage rank of matches using Levenshtein Distance matching

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched…
user1368587
  • 321
  • 1
  • 3
  • 5
27
votes
7 answers

String similarity -> Levenshtein distance

I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as…
Fede Lerner
  • 457
  • 1
  • 6
  • 14
1
2
3
64 65