Questions tagged [levenshtein-distance]

A metric for measuring the amount of difference between two sequences. The Levenshtein distance allows deletion, insertion and substitution.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance is a specific algorithm of edit distance algorithms.

References:
Wikipedia
RosettaCode
Edit Distance (Wikipedia)
Hirschberg's algorithm (Wikipedia)

967 questions
17
votes
2 answers

How can I determine Levenshtein distance for Mandarin Chinese characters?

We are developing a system to do fuzzy matching on over 50 international languages using the UTF-8, UTF-16, and UTF-32 Unicode character standard. So far, we have been able to use Levenshtein distance to detect misspellings of German Unicode…
Frank
  • 1,406
  • 2
  • 16
  • 42
15
votes
1 answer

Efficiently determine "how sorted" a list is, eg. Levenshtein distance

I'm doing some research on ranking algorithms, and would like to, given a sorted list and some permutation of that list, calculate some distance between the two permutations. For the case of the Levenshtein distance, this corresponds to calculating…
15
votes
2 answers

Algorithm to find edit distance to all substrings

Given 2 strings s and t. I need to find for each substring in s edit distance(Levenshtein distance) to t. Actually I need to know for each i position in s what is the minimum edit distance for all substrings started at position i. For example: t =…
15
votes
3 answers

String Distance Matrix in Python

How to calculate Levenshtein Distance matrix of strings in Python ? str1 str2 str3 str4 ... strn str1 0.8 0.4 0.6 0.1 ... 0.2 str2 0.4 0.7 0.5 0.1 ... 0.1 …
15
votes
3 answers

How do diff/patch work and how safe are they?

Regarding how they work, I was wondering low-level working stuff: What will trigger a merge conflict? Is the context also used by the tools in order to apply the patch? How do they deal with changes that do not actually modify source code behavior?…
cenouro
  • 715
  • 3
  • 15
14
votes
2 answers

how to convert a string into a palindrome with minimum number of operations?

Here is the problem states to convert a string into a palindrome with minimum number of operations. I know it is similar to the Levenshtein distance but I can't solve it yet For example, for input mohammadsajjadhossain, the output is 8.
user467871
14
votes
6 answers

Text similarity algorithm

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will…
EugeneP
  • 11,783
  • 32
  • 96
  • 142
14
votes
1 answer

How to normalise Levenshtein distance for maximum alignment length rather than for string length?

Problem: A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html. The distances computed can easily be normalised for…
jvh_ch
  • 337
  • 2
  • 11
14
votes
3 answers

Levenshtein distance in regular expression

Is it possible to include Levenshtein distance in a regular expression query? (Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1: .ello | h.llo | he.lo | hel.o | hell. since this is stupid and…
zdenda.online
  • 2,451
  • 3
  • 23
  • 45
13
votes
5 answers

Levenshtein distance symmetric?

I was informed Levenshtein distance is symmetric. When I used google's diffMatchPatch tool which computes Levenshtein distance among other things, the results don't imply Levenshtein distance is symmetric. i.e Levenshtein(x1,x2) is not equal to…
user1271793
13
votes
9 answers

Efficient string similarity grouping

Setting: I have data on people, and their parent's names, and I want to find siblings (people with identical parent names). pdata<-data.frame(parents_name=c("peter pan + marta steward", "pieter pan + marta…
sheß
  • 484
  • 4
  • 20
13
votes
9 answers

How do I convert between a measure of similarity and a measure of difference (distance)?

Is there a general way to convert between a measure of similarity and a measure of distance? Consider a similarity measure like the number of 2-grams that two strings have in common. 2-grams('beta', 'delta') = 1 2-grams('apple', 'dappled') = 4 What…
135498
  • 251
  • 1
  • 4
  • 6
13
votes
2 answers

Calculating a relative Levenshtein distance - make sense?

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same". Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of…
Joseph Tura
  • 6,290
  • 8
  • 47
  • 73
12
votes
3 answers

Matching an approximate string in a Core Data store

I have a small problem with the core data application i'm currently writing. I have two differents models, contexts and peristent stores. One is for my app data, the other one is for a website with relevant infos to me. Most of the time, I match…
damdamdam
  • 141
  • 1
  • 5
12
votes
6 answers

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful. The Wikipedia article on edit distance gives some good…