Questions tagged [fuzzy-comparison]

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly).

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly). This problem is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.


Useful links


Related tags

361 questions
7
votes
1 answer

Generate "fuzzy" difference of two files in Python, with approximate comparison of floats

I have an issue for comparing two files. Basically, what I want to do is a UNIX-like diff between two files, for example: $ diff -u left-file right-file However my two files contain floats; and because these files were generated on distinct…
7
votes
6 answers

Better fuzzy matching performance?

I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings: a=['blah','pie','apple'...] b=['jimbo','zomg','pie'...] for value…
7
votes
4 answers

q-gram approximate matching optimisations

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around…
Peter
  • 545
  • 4
  • 9
  • 15
6
votes
1 answer

Merge dataframes on multiple columns with fuzzy match in Python

I have two example dataframes as follows: df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 'Age': {0: 27, 1: 23, 2: 21}}) df2 =…
ah bon
  • 9,293
  • 12
  • 65
  • 148
6
votes
1 answer

Need more understanding on python fuzz partial ratio

I am using python fuzzywuzzy on an enterprise level to match 2 strings. It works fine in most of the cases but giving unexpected results in the below mentioned scenario: fuzz.partial_ratio('ja rule:mesmerize','ja rule feat. ashanti:mesmerize') gives…
Sains
  • 457
  • 1
  • 7
  • 19
6
votes
1 answer

Fuzzy Address matching R

Yeah, it's been asked before, but I can't find a thread that provides a simple, clean answer to this question. I have example data below - I have two columns, col1 is the current address, col2 is an address I am told is 'better' than the current…
Adam_S
  • 687
  • 2
  • 12
  • 24
6
votes
2 answers

Fuzzy matching/chunking algorithm

Background: I have video clips and audio tracks that I want to sync with said videos. From the video clips, I'll extract a reference audio track. I also have another track that I want to synchronize with the reference track. The desync comes from…
Confluence
  • 1,331
  • 1
  • 10
  • 26
6
votes
4 answers

SQL Fuzzy Join - MSSQL

I have two sets of data. Existing customers and potential customers. My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are…
hansolo
  • 903
  • 4
  • 12
  • 28
6
votes
1 answer

The best way to search millions of fuzzy hashes

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like…
retrography
  • 6,302
  • 3
  • 22
  • 32
6
votes
3 answers

Pandas fuzzy merge/match name column, with duplicates

I have two dataframes currently, one for donors and one for fundraisers. I'm trying to find if any fundraisers also gave donations, and if so, copy some of that information into my fundraiser dataset (donor name, email and their first donation).…
Wizuriel
  • 3,617
  • 4
  • 21
  • 26
5
votes
3 answers

approximate RegEx in python with TRE: strange unicode behavior

I am trying to use the TRE-library in python to match misspelled input. It is important, that it does handle utf-8 encoded Strings well. an example: The German capital's name is Berlin, but from the pronunciation it is the same, if people would…
vikingosegundo
  • 52,040
  • 14
  • 137
  • 178
5
votes
3 answers

Fuzzy Match columns of Different Dataframe

Background I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities. Sample Data: vendor_df= Name of Vendor …
Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
5
votes
2 answers

difflib on Ruby

Is there a library similar to Python's difflib on Ruby? Particularly, I need one that has a method similar to difflib.get_close_matches. Any recommendations?
fjsj
  • 10,995
  • 11
  • 41
  • 57
5
votes
1 answer

Fuzzy Wuzzy String Matching on 2 Large Data Sets Based on a Condition - python

I have 2 large data sets that I have read into Pandas DataFrames (~ 20K rows and ~40K rows respectively). When I try merging these two DFs outright using pandas.merge on the address field, I get a paltry number of match compared to the number of…
Nirav
  • 53
  • 1
  • 1
  • 6
5
votes
2 answers

Best machine learning approach to automate text/fuzzy matching

I'm reasonably new to machine learning, I've done a few projects in python. I'm looking for advice on how to approach the below problem which I believe could be automated. A user in a data quality team in my organisation has a daily task of taking a…
1 2
3
24 25