Questions tagged [fuzzy-comparison]

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly).

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly). This problem is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.


Useful links


Related tags

361 questions
4
votes
2 answers

Java library for fuzzy comparing text strings

I'm looking for a tool that would compare two text strings and return a result being in fact the indicator of their similarity (e.g. 95%). It needs to be implemented on a platform supporting Java libraries. My best guess is that I need some fuzzy…
mikolajek
  • 91
  • 2
  • 9
4
votes
0 answers

How does agrep matching work?

The agrep function gives some puzzling results and I'd like to understand its behavior better. For example: agrep("abcd",c("abc","abcde","abcef"),value=T,max.distance = 1) Returns: [1] "abc" "abcde" "abcef" But the distance between "abcd" and…
xyy
  • 547
  • 1
  • 5
  • 12
4
votes
1 answer

Fuzzy logic on big datasets using Python

My team has been stuck with running a fuzzy logic algorithm on a two large datasets. The first (subset) is about 180K rows contains names, addresses, and emails for the people that we need to match in the second (superset). The superset contains…
4
votes
1 answer

Similarity score of two lists with strings

I have a list of strings as a query and a few hundrends of other lists of strings. I want to compare the query with every other list and extract a similarity score between them. Example: query = ["football", "basketball", "martial arts",…
Tasos
  • 7,325
  • 18
  • 83
  • 176
4
votes
1 answer

Peculiar behaviour of Jaro Distance in JellyFish

I am trying to use Jellyfish to work with fuzzy strings. I am noticing some strange behaviour of the jaro_distance algorithm. I had some issues previously with the damerau_levenshtein_distance algorithm which appeared to be a bug in the code, which…
Woody Pride
  • 13,539
  • 9
  • 48
  • 62
4
votes
1 answer

Is Jellyfish's Damerau–Levenshtein distance calculation buggy?

I am trying to use Jellyfish to work with fuzzy strings. I am noticing some strange behaviour of the Damerau–Levenshtein distance algorithm. For example: import jellyfish as jf In [0]: jf.damerau_levenshtein_distance('ZX', 'XYZ') Out[0]: 3 In [1]:…
Woody Pride
  • 13,539
  • 9
  • 48
  • 62
4
votes
2 answers

Python "regex" module: Fuzziness value

I'm using the "fuzzy match" functionality of the Regex module. How can I get the "fuzziness value" of a "match" which indicates how different the pattern is to the string, just like the "edit distance" in Levenshtein? I thought I could get the…
tslmy
  • 628
  • 1
  • 6
  • 21
4
votes
3 answers

What is the best way to compare decimals?

What is the best way to compare decimals ? lets say I have 2 values, like 3.45 and 3.44, what is the best way to reliably compare them ? I was thinking of storing all numbers as 345 and 344 so that I am comparing whole numbers only, and only show to…
sharp12345
  • 4,420
  • 3
  • 22
  • 38
3
votes
1 answer

OCR: Choose the best string based on last N results (an adaptive filter for OCR)

I've seen some questions on deciding the best OCR result given output from different engines, and the answer is typically "choose the best engine". I want, however, to capture several frames of text images, with possible temporary occlusions or…
jpimentel
  • 694
  • 1
  • 7
  • 23
3
votes
1 answer

How to normalize company names

We have user generated names of employers that come in all variations. For example, people have typed in or imported: Google Google, Inc. Google Inc. Google inc To a database search this, looks like a different company all together. We've changed…
user577808
  • 2,317
  • 4
  • 21
  • 31
3
votes
1 answer

Fuzzy Searching a Column in Pandas

Is there a way to search for a value in a dataframe column using FuzzyWuzzy or similar library? I'm trying to find a value in one column that corresponds to the value in another while taking fuzzy matching into account. So So for example, if I have…
3
votes
0 answers

How do I implement a custom comparator in the Python Dedupe library?

I'm using the so-far great Dedupe library to help link records from multiple providers. One of the fields I'm comparing is a phone number field. I'd like to use Google's phone number library to normalize these phone numbers. One other nice…
3
votes
1 answer

A more accurate and more efficient fuzzy searching algorithm

I have been researching fuzzy match / search algorithms across the internet. I have tried a couple of solutions. The only that gave somewhat accurate results was from Mr. Excel (http://www.mrexcel.com/pc07.shtml). The problem with this method is the…
3
votes
1 answer

Quicker way to perform fuzzy string match in pandas

Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas. I have a dataframe as extra_names which has names that I want to run fuzzy matches for with another dataframe as names_df. >> extra_names.head() not_matching 0 Vij…
Aman Singh
  • 1,111
  • 3
  • 17
  • 31
3
votes
1 answer

Fuzzy match strings in one column and create new dataframe using fuzzywuzzy

I have the following dataframe: df = pd.DataFrame( {'id': [1, 2, 3, 4, 5, 6], 'fruits': ['apple', 'apples', 'orange', 'apple tree', 'oranges', 'mango'] }) id fruits 0 1 apple 1 2 apples 2 3 orange 3 4 …
ah bon
  • 9,293
  • 12
  • 65
  • 148