Questions tagged [levenshtein-distance]

A metric for measuring the amount of difference between two sequences. The Levenshtein distance allows deletion, insertion and substitution.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance is a specific algorithm of edit distance algorithms.

References:
Wikipedia
RosettaCode
Edit Distance (Wikipedia)
Hirschberg's algorithm (Wikipedia)

967 questions
0
votes
0 answers

Fuzzysearch: Looping over strings and keep the most similar

I have a number of correctly spelled names in a list L_name = ['Julia', 'John', 'James', 'Jay', 'Jordan'] And I have a list of results from a form where people entered their name L_entries = ['Julie', 'John', 'Jo', 'Jamie', 'Jamy', 'James', 'Jay',…
user9092346
  • 292
  • 2
  • 11
0
votes
1 answer

String matching function between two columns using Levenshtein distance in PySpark

I am trying to compare pairs of names by converting the levenshtein distance between them to a matching coef such as : coef = 1 - Levenstein(str1, str2) / max(length(str1) , length(str2)) However, when I implement it in PySpark using withColumn(),…
0
votes
1 answer

Negative array index in a pseudocode

Consider this sample of a pseudocode for Levenshtein Distance: function Leven_dyn(S1[1..m], S2[1..n], m, n); var D[0..m, 0..n]; begin for i := 0 to m do D[i, 0] := i; end for for j := 0 to n do D[0, j] := j; end for for i := 0 to m…
0
votes
1 answer

How to compute word per token word distance and return the count of 0 distance in a column

I got two descriptions, one in a dataframe and other that is a list of words and I need to compute the levensthein distance of each word in the description against each word in the list and return the count of the result of the levensthein distance…
0
votes
1 answer

Quick search for a start location of a given string

Here, I would like to match a given string match_text to a longer string text. I want to find match_text's start location in text, the closest one (you can assume that there is only one location). My current version of the code is to for loop…
titipata
  • 5,321
  • 3
  • 35
  • 59
0
votes
1 answer

I've been trying to match the a name received from 2 sources with each other and check if they are almost a match or not

In the sample data, I've listed the names of employers of a particular person(a prospective customer) which we received from 2 different sources. I've been trying to find a way to better match the two names and get good results. (Currently, it's…
0
votes
0 answers

Is there a way to measure the similarity of a group of strings? A kind of 'mass' Levenshtein distance?

Levenshtein distance is a gauge of the distance between two strings. Is there a similar metric for assessing how similar a group of strings are to each other, a kind of mean or group distance? EDIT: I realize now that my question above could be…
Chris T
  • 453
  • 1
  • 6
  • 17
0
votes
1 answer

ImportError: cannot import name 'ratio' from 'Levenshtein' (unknown location)

When I try import fuzzymatcher I get the error: ImportError: cannot import name 'ratio' from 'Levenshtein' (unknown location)
Ee Ann Ng
  • 109
  • 1
  • 8
0
votes
1 answer

More efficient way to compare the contents of thousands of text files

I have around 10,000 text files and quite a lot of them have very similar content. I am trying to get rid of the files that are very similar to each other so that I am left with a smaller and more unique set. Just for reference the contents of the…
0
votes
0 answers

Fuzzy Lookup - Levenshtein with Syllable and Substring Match

What i tried to do is, create a fuzzy lookup algortihm that helps us to match these unique records with our mapping table by and shows us the percentage. I am applying Levenshtein algorithm to find the matches which is well known algorithm for fuzzy…
0
votes
2 answers

similarity score between phrases

Levenshtein distance is an approach for measuring the difference between words, but not so for phrases. Is there a good distance metric for measuring differences between phrases? For example, if phrase 1 is made of n words x1 x2 x_n, and phrase 2 is…
user1424739
  • 11,937
  • 17
  • 63
  • 152
0
votes
1 answer

clustering set of string sentences into unknown number of groups

I have a set of sentences (each sentence = x number of rows where x belongs to range (1,6)). I want to group these sentences based on the similarities between them. I have tried fuzzy wuzzy.token_set_ration but the trouble I have is that I need to…
0
votes
1 answer

How to call Levenshtien Function using the values from two different tables in T-SQL

I am trying to find the Levenshtien distance between the columns of two different tables TableA and TableB. Basically I need to match ColumnA of TableA with all the elements of ColumnB in TableB and find the Levenshtien Distance I have created a…
user5593950
0
votes
1 answer

Finding words in long string within edit distance ignoring whitespace

I am looking for a algorithm to efficiently search for words within given edit distance in a query string while ignoring whitespace. For e.g. If words on which I need to build an index are: OHIO, WELL and query String: HELLO HI THERE H E L L O…
0
votes
1 answer

identifier splitting to approximately match documentation

Different software projects have different coding convention; even in the same project there may be different languages used and will have different convention. What is good for searching documentation (which appear outside the source files), with…
Tathagata
  • 1,955
  • 4
  • 23
  • 39