Questions tagged [levenshtein-distance]

A metric for measuring the amount of difference between two sequences. The Levenshtein distance allows deletion, insertion and substitution.

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance is a specific algorithm of edit distance algorithms.

References:
Wikipedia
RosettaCode
Edit Distance (Wikipedia)
Hirschberg's algorithm (Wikipedia)

967 questions
0
votes
2 answers

Levenshtein distance. Max distance exception

I have this levenstein algorithm: public static int? GetLevenshteinDistance(string input, string output, int maxDistance) { var stringOne = String.Empty; var stringTwo = String.Empty; if (input.Length >=…
user3499880
0
votes
0 answers

Using Hamming distance to group a list of DNA sequences into similar sequences (Python)?

I've extracted a list of DNA sequences from a FASTQ file and its currently being stored in a list for now. sequences = ['ATCT','ATTT','ACGG','ACCG','ACGT','AGGT','ATGC','ATCC','AGTT'] I want to cluster sequences into a list of tuples so that each…
0
votes
2 answers

Match items from two sets of data by highest % of similarities

Task: I have two columns with product names. I need to find the most similar cell from Column B for Cell A1, then for A2, A3 and so on. Input: Col A | Col B ------------- Red  | Blackwell Black | Purple   White | Whitewater  Green |…
0
votes
1 answer

Levenshtein distance

I would like to know if there is any algorithm that returns the amount of insertions, deletions and substitutions between two words. Most algorithms only return and integer with the distance between the two words but I would like to also have how…
m4sh4
  • 9
  • 1
0
votes
1 answer

Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

I came across a forum post that describes a method of creating a Python UDF in Redshift: https://community.periscopedata.com/r/y715m2. More info about Python UDFs in Redshift:…
user8834780
  • 1,620
  • 3
  • 21
  • 48
0
votes
1 answer

Which algorithm to match most similar string from a set?

Let's say I have a database of books that includes their titles. For a given listing from eBay or Craigslist or some other such site, I want to compare its title string to all of the book titles in my database to try to find a match. It's unlikely…
user457586
0
votes
1 answer

vectorized text mining over multiple columns

I have some code that I would like to vectorize but I am not sure how. The following code gives some example data, comprised of names and addreses. name <- c("holiday inn", "geico", "zgf", "morton phillips") address <- c("400 lafayette pl tupelo…
jvalenti
  • 604
  • 1
  • 9
  • 31
0
votes
1 answer

How to compute multiple related Levenshtein distances?

Given two strings of equal length, Levenshtein distance allows to find the minimum number of transformations necessary to get the second string, given the first. However, I'd like to find a way to adjust the alogrithm for multiple pairs of strings,…
user490735
  • 755
  • 2
  • 9
  • 18
0
votes
1 answer

Python - Assign the closest string from List A to List B based on Levenshtein distance - (ideally with pandas)

As introduction, I am pretty new to python, I just know how to use pandas mainly for data analysis. I currently have 2 lists of 100+ entries, "Keywords" and "Groups". I would like to generate an output (ideally a dataframe in pandas), where for…
0
votes
0 answers

Python with mongodb

I am very beginner to python. I have two tables named as Table A and Table B, In Table A have 1M record is available and Table B have 14M records is available and each record is a very big sentence(Paragraph) with special character numbers etc.., I…
ramki
  • 92
  • 1
  • 2
  • 10
0
votes
1 answer

Fuzzy string search in array with postgresql

This is how I do fuzzy string search in postgresql: select * from table where levenshtein(name, 'value') < 2; But what can I do if the 'name' colum contains array? P.S.: It is necessary to use index. And this is the difference.
kz_sergey
  • 677
  • 5
  • 19
0
votes
1 answer

Levenshtein implementation in Clojure with memoization

This is a minimal Levenshtein (edit distance) implementation using Clojure with recursion: (defn levenshtein [s1, i1, s2, i2] (cond (= 0 i1) i2 (= 0 i2) i1 :else (min (+ (levenshtein s1 (- i1 1) s2 i2) 1) (+ (levenshtein…
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76
0
votes
2 answers

pandas algorithm slow: for loops and lambda

summary: I am searching for misspellings between a bunch of data and it is taking forever I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For…
jleatham
  • 456
  • 8
  • 17
0
votes
0 answers

How can I visualize all changes in one string compared to another?

Currently, I use https://jsfiddle.net/MartinThoma/h9kL6zox/1/ (see this answer) to highlight changes from one string (< 255 chars) to another string (<255 chars). I can only add highlighting code to one of them. There are three types of changes…
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
0
votes
1 answer

Python/Pandas - String Comparisons

I have a list of strings/narratives which I need to compare and get a distance measure between each string. The current code I have written works but for larger lists it takes along time since I use 2 for loops. I have used the levenshtien distance…
Bryce Ramgovind
  • 3,127
  • 10
  • 41
  • 72