Questions tagged [edit-distance]

A string metric describing the differences between two strings. More specifically, it is the number of operations that transform one string into another string. Operations include the insertion, deletion, substitution, or transposition of a character in the string. Operations can be considered in combinations and may have different costs.

References

Edit distance (Wikipedia)

256 questions
7
votes
2 answers

Oracle fuzzy text search with wildcards

I've got a SAP Oracle database full with customer data. In our custom CRM it is quite common to search the for customers using wildcards. In addtion to the SAP standard search, we would like to do some fuzzy text searching for names which are…
Florian
  • 5,918
  • 3
  • 47
  • 86
7
votes
1 answer

How is Levenshtein Distance calculated on Simplified Chinese characters?

I have 2 queries: query1:你好世界 query2:你好 When i run this code using the python library Levenshtein: from Levenshtein import distance, hamming, median lev_edit_dist = distance(query1,query2) print lev_edit_dist I get an output of 12. Now…
jxn
  • 7,685
  • 28
  • 90
  • 172
6
votes
1 answer

Difference in normalization of Levenshtein (edit) distance?

If the Levenshtein distance between two strings, s and t is given by L(s,t), what is the difference in the impact on the resulting heuristic of the following two different normalization approaches? L(s,t) / [length(s) + length(t)] L(s,t) /…
user2205916
  • 3,196
  • 11
  • 54
  • 82
6
votes
3 answers

Optimize R code to create distance matrix based on customized distance function

I am trying to create a distance matrix (to use for clustering) for strings based on customized distance function. I ran the code on a list of 6000 words and it is still running since last 90 minutes. I have 8 GB RAM and Intel-i5, so the problem is…
Gaurav Singhal
  • 998
  • 2
  • 10
  • 25
6
votes
3 answers

Generate regular expression for given string and edit distance

I have the problem that I want to match all strings in the database having a certain edit distance to a given string. My idea was to generate a regular expression that would match all strings with edit distance d to string s. So for example I want…
Martin Cup
  • 2,399
  • 1
  • 21
  • 32
6
votes
2 answers

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings. Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The…
seinecle
  • 10,118
  • 14
  • 61
  • 120
5
votes
3 answers

Quickly compare a string against a Collection in Java

I am trying to calculate edit distances of a string against a collection to find the closest match. My current problem is that the collection is very large (about 25000 items), so I had to narrow down the set to just strings of similar lengths but…
Lezan
  • 667
  • 2
  • 7
  • 20
5
votes
2 answers

Edit distance with swaps

Edit distance finds the number of insertion, deletion or substitutions required to one string to another. I want to to also include swaps in this algorithm. For example "apple" and "appel" should give a edit distance of 1.
Raja Roy
  • 75
  • 3
  • 4
5
votes
1 answer

Efficiently calculate edit distance between two strings

I have a string S of length 1000 and a query string Q of length 100. I want to calculate the edit distance of query string Q with every sub-string of string S of length 100. One naive way to do is calculate dynamically edit distance of every…
rombi
  • 199
  • 3
  • 22
5
votes
3 answers

Modify Levenshtein-Distance to ignore order

I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance. How would I implement this into the iterative or recursive algorithm? Example: # Currently…
Luis
  • 85
  • 6
5
votes
3 answers

Wagner–Fischer algorithm

I'm trying to understand the Wagner–Fischer algorithm for finding distance between to strings. I'm looking through a pseudocode of it in the following link: http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm int EditDistance(char…
mdavid
  • 563
  • 6
  • 20
5
votes
1 answer

Redshift: Any ways to compute fuzzy string similarity / string edit distance?

In PSQL (which I believe Redshift is based), there are string similarity functions like levenshtein / levenshtein_less_equal [ http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html ]. These features don't seem to have made it into Redshift […
gatoatigrado
  • 16,580
  • 18
  • 81
  • 143
5
votes
1 answer

How to get an edit-distance between two commits?

I'm looking for a way to compute a good edit distance between the contents of any two commits. The best I've found is to derive something from the output of git diff --numstat ...but anything I can come up using this…
kjo
  • 33,683
  • 52
  • 148
  • 265
5
votes
2 answers

Looking for similar words

I'm trying to write a spellchecker module. It loads a text, creates a dictionary from 16 mb file and then checks if encountered word is similar to the word in dictionary (similar = varies up to two chars) if so then it changes it to the form from…
Michal
  • 6,411
  • 6
  • 32
  • 45
4
votes
0 answers

string similarity of optimal alignment

Expected Behaviour of the algorithm I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of…
1 2
3
17 18