Questions tagged [levenshtein-distance]

A metric for measuring the amount of difference between two sequences. The Levenshtein distance allows deletion, insertion and substitution.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance is a specific algorithm of edit distance algorithms.

References:
Wikipedia
RosettaCode
Edit Distance (Wikipedia)
Hirschberg's algorithm (Wikipedia)

967 questions
26
votes
3 answers

Normalizing the edit distance

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two…
25
votes
11 answers

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"…
Ash
24
votes
4 answers

Java fuzzy String matching with names

I've got a stand-alone CSV data loading process that I coded in Java that has to use some fuzzy string matching. It's definitely not ideal, but I don't have much choice. I am matching using a first and last name and I cache all the possibilities at…
Durandal
  • 5,575
  • 5
  • 35
  • 49
24
votes
4 answers

Where can the documentation for python-Levenshtein be found online?

I've found a great python library implementing Levenshtein functions (distance, ratio, etc.) at http://code.google.com/p/pylevenshtein/ but the project seems inactive and the documentation is nowhere to be found. I was wondering if anyone knows…
Phil B
  • 5,589
  • 7
  • 42
  • 58
23
votes
4 answers

what is a good metric for deciding if 2 Strings are "similar enough"

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings. What I'm doing currently is basically taking the total number of edits…
Hristo
  • 45,559
  • 65
  • 163
  • 230
23
votes
2 answers

Edit distance such as Levenshtein taking into account proximity on keyboard

Is there an edit distance such as Levenshtein which takes into account distance for substitutions? For example, if we would consider if words are equal, typo and tylo are really close (p and l are physically close on the keyboard), while typo and…
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
22
votes
2 answers

Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as…
22
votes
4 answers

Edit distance between two graphs

I'm just wondering if, like for strings where we have the Levenshtein distance (or edit distance) between two strings, is there something similar for graphs? I mean, a scalar measure that identifies the number of atomic operations (node and edges…
linello
  • 8,451
  • 18
  • 63
  • 109
20
votes
1 answer

How do you implement Levenshtein distance in Delphi?

I'm posting this in the spirit of answering your own questions. The question I had was: How can I implement the Levenshtein algorithm for calculating edit-distance between two strings, as described here, in Delphi? Just a note on performance: This…
JosephStyons
  • 57,317
  • 63
  • 160
  • 234
20
votes
6 answers

Alternative to Levenshtein and Trigram

Say I have the following two strings in my database: (1) 'Levi Watkins Learning Center - Alabama State University' (2) 'ETH Library' My software receives free text inputs from a data source, and it should match those free texts to the pre-defined…
Jonas Sourlier
  • 13,684
  • 16
  • 77
  • 148
20
votes
7 answers

How to install python-levenshtein on Windows?

After searching for days I'm about ready to give up finding precompiled binaries for Python 2.7 (Windows 64-bit) of the Python Levenshtein library, so not I'm attempting to compile it myself. I've installed the most recent version of MinGW32…
Hubro
  • 56,214
  • 69
  • 228
  • 381
18
votes
5 answers

How to sort an array by similarity in relation to an inputted word.

I have on PHP array, for example: $arr = array("hello", "try", "hel", "hey hello"); Now I want to do rearrange of the array which will be based on the most nearly close words between the array and my $search var. How can I do that?
AimOn
  • 181
  • 1
  • 4
18
votes
10 answers

Can't install Levenshtein distance package on Windows Python 3.5

I need to install python Levenshtein distance package in order to use this library. Unfortunately, I am not able to install it succesfully. I usually install libraries with pip. However, this time I am getting error: [WinError 2] The system cannot…
hipoglucido
  • 545
  • 1
  • 7
  • 20
18
votes
9 answers

Best way to detect similar email addresses?

I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as username1@gmail.com, username1a@gmail.com, username1b@gmail.com, etc. I want to find similar email addresses for…
Chris
  • 27,596
  • 25
  • 124
  • 225
17
votes
1 answer

Is there a sparse edit distance algorithm?

Say you have two strings of length 100,000 containing zeros and ones. You can compute their edit distance in roughly 10^10 operations. If each string only has 100 ones and the rest are zeros then I can represent each string using 100 integers…
Simd
  • 19,447
  • 42
  • 136
  • 271
1 2
3
64 65