Questions tagged [edit-distance]

A string metric describing the differences between two strings. More specifically, it is the number of operations that transform one string into another string. Operations include the insertion, deletion, substitution, or transposition of a character in the string. Operations can be considered in combinations and may have different costs.

References

Edit distance (Wikipedia)

256 questions
15
votes
8 answers

Efficient way of calculating likeness scores of strings when sample size is large?

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how…
matt b
  • 138,234
  • 66
  • 282
  • 345
14
votes
1 answer

How to normalise Levenshtein distance for maximum alignment length rather than for string length?

Problem: A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html. The distances computed can easily be normalised for…
jvh_ch
  • 337
  • 2
  • 11
13
votes
3 answers

Fast(er) algorithm for the Length of the Longest Common Subsequence (LCS)

Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough. My problem is that I am dealing with lot's…
Yiannis
  • 131
  • 1
  • 6
12
votes
6 answers

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful. The Wikipedia article on edit distance gives some good…
12
votes
20 answers

Given two strings, find if they are one edit away from each other

I came across this question recently: Given two strings, return true if they are one edit away from each other,else return false. An edit is insert/replace/delete a character. Ex. {"abc","ab"}->true, {"abc","adc"}->true, {"abc","cab"}->false One…
codewarrior
  • 984
  • 4
  • 18
  • 34
10
votes
1 answer

Quickly check large database for edit-distance similarity

I have a database of 350,000 strings with an average length of about 500. The strings are not made up of words, they are an essentially random assortment of characters. I need to make sure no two of the strings are too similar, where similarity is…
Evan Weissburg
  • 1,564
  • 2
  • 17
  • 38
9
votes
5 answers

How do I calculate the "difference" between two sequences of points?

I have two sequences of length n and m. Each is a sequence of points of the form (x,y) and represent curves in an image. I need to find how different (or similar) these sequences are given that fact that one sequence is likely longer than the other…
WanderingPhd
  • 189
  • 1
  • 9
9
votes
1 answer

is there any way to calculate % match between 2 strings

Is there any way to calculate % match between 2 strings? i have a situation where it is required to calculate matches between 2 strings if there is 85% match then i will combine 2 tables, i have written the code for combining 2 tables my sample…
Dilip G
  • 489
  • 2
  • 7
  • 15
9
votes
1 answer

Change distance between x-axis ticks in ggplot2

Right now I am producing a line graph with three observations. Hence, there are three x-axis ticks. I want to manually reduce the distance between the x-axis ticks and basically force the observations to be closer to each other. In other words, I…
user1738753
  • 626
  • 4
  • 12
  • 19
8
votes
1 answer

String distance, transpositions only

Possible Duplicate: Counting the swaps required to convert one permutation into another I'm looking for an algorithm that would count some kind of string distance where only allowed operation is transposition of two adjacent characters. For…
8
votes
1 answer

Levenshtein Distance Formula in CoffeeScript?

I am trying to create or find a CoffeeScript implementation of the Levenshtein Distance formula, aka Edit Distance. Here is what I have so far, any help at all would be much appreciated. levenshtein = (s1,s2) -> n = s1.length m = s2.length …
8
votes
2 answers

What's the difference between Levenshtein distance and the Wagner-Fischer algorithm

The Levenshtein distance is a string metric for measuring the difference between two sequences. The Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters. Both using a matrix,…
8
votes
2 answers

How can I compare two strings to find the number of characters that match in R, using substitution distance?

In R, I have two character vectors, a and b. a <- c("abcdefg", "hijklmnop", "qrstuvwxyz") b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz") I want a function that counts the character mismatches between each element of a and the corresponding element…
Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
8
votes
1 answer

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database. I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is…
7
votes
2 answers

Abbreviation similarity between strings

I have a use case in my project where I need to compare a key-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key and based on that list, I do some further…
vish4071
  • 5,135
  • 4
  • 35
  • 65
1
2
3
17 18