Questions tagged [levenshtein-distance]

A metric for measuring the amount of difference between two sequences. The Levenshtein distance allows deletion, insertion and substitution.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance is a specific algorithm of edit distance algorithms.

References:
Wikipedia
RosettaCode
Edit Distance (Wikipedia)
Hirschberg's algorithm (Wikipedia)

967 questions
0
votes
1 answer

Filter duplicates from list based on Levenshtein distance

Let's say that I have a list of JSONs like in the example. Of those that have duplicate title attribute (as determined by scoring over a certain threshold of Levenshtein distance), I'd like to filter out the duplicates that do not have the minimum…
Chris C
  • 599
  • 2
  • 8
  • 19
0
votes
1 answer

Is this Levenshtein Distance algorithm correct?

I have written the algorithm below to compute the Levenshtein distance, and it seems to return the correct results based on my tests. The time complexity is O(n+m), and the space is O(1). All the existing algorithms I've seen only for this have…
dwally89
  • 51
  • 2
  • 7
0
votes
1 answer

R - return n matches via levenshtein distance

I would like to find the n best matches to a given string via levenshtein distance. I know that the adist function in R gives the minimal distance, but I am attempting to scale the number of results to, say, 10. I have some code below. name <-…
jvalenti
  • 604
  • 1
  • 9
  • 31
0
votes
2 answers

Getting the closest string match (with possibly very different string sizes)

I am looking for a way to find the closest string match between two strings that could eventually have a very different size. Let's say I have, on the one hand, a list of possible locations like: Yosemite National Park Yosemite Valley Yosemite…
0
votes
1 answer

Is specific Lucene classes are intended to be consumed by applications?

I'm new to the Apache Lucene library. I'd like to directly consume a class in this library called: LevenshteinDistance to calculated similarity search between strings. Would that be correct for my own application to directly consume it, or should I…
aQ123
  • 560
  • 1
  • 8
  • 19
0
votes
1 answer

Why Levenshtein distance from "abcd" to "badc" is 3?

Starting from "abcd", I want to go to "badc" : I remove "a" -> "bcd"; I insert "a" at the right position -> "bacd"; I remove "c" -> "bad"; I insert "c" at the right position -> "badc". It's 4 operations. I can't find out a shorter way to do it.…
jules
  • 73
  • 2
  • 8
0
votes
0 answers

Scikit Learn for clustering mixed data (numeric & categorical)

Can someone please help modify the working example below to create clusters from the shared data? The example uses Mean Shift clustering from Scikit-Learn to identify patches of similar/co-located plant species in an agronomical facility. Similar…
0
votes
0 answers

Is there a way to filter what the python-levenshtein extension changes?

Ive got a large list of names (strings) that I have to check against each other to see if there are any typos. To do this I've been using the pypi python-Levenshtein extension against the iterated list, with a typo being considered as a comparison…
AngusOld
  • 43
  • 1
  • 7
0
votes
1 answer

Creating Multiple Distance Matrix using Levensthein Algorithm on C++

I am developing a programme about string analyses. I've done levensthein distance algorithm but it is just applicable for two strings. Now, I want to run more strings than two, concurrently. Like, from text file; S1 : "0123412315", S2 :…
Ugur Ç.
  • 3
  • 6
0
votes
1 answer

php find correct closest word

as we khow, we can find closest words by levenshtein for example:
DolDurma
  • 15,753
  • 51
  • 198
  • 377
0
votes
3 answers

fuzzywuzzy to normalize string in pandas column

I have a dataframe like this now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output…
nOObda
  • 123
  • 1
  • 2
  • 9
0
votes
1 answer

string matching irrespective of order of words and short forms in R : fuzzy string matching in R

I am new to R, and want to compare 2 strings(addresses) where Word order could be different, other than numbers. (Consecutive numbers need to be in same order) Words could be at times in short form, eg street could be st., North West could be…
Aditya Kuls
  • 115
  • 1
  • 7
0
votes
1 answer

Javascript Version of (MDLD) Modified Damerau-Levenshtein Distance Algorithm

I was looking to test the performance of MDLD for some in-browser string comparisions to be integrated into a web-app. The use-case involves comparing strings like, "300mm, Packed Wall" and "Packed Wall - 300mm", so I was looking for fuzzy string…
0
votes
0 answers

Best algorithm suited for finding most similar town

I have a dataset full of towns and their geographical data and some input data. This input data is, almost always, a town as well. But it being towns scraped off of the internet, they can be a little bit misspelled or spelled differently. e.g. Saint…
0
votes
1 answer

match two datasets with record linkage in R

I am trying to match two datasets in R: datasetA and datasetB. These datasets contain the following columns. datasetA ID: 15 Name: peter sanders First_Name: peter Last_Name: sanders ORG_NAME:coffee&cake City: New York Amount(USD): 10369…