Questions tagged [fuzzywuzzy]

FuzzyWuzzy is a Python package to perform fuzzy string matching.

FuzzyWuzzy is a Python package to perform fuzzy string matching.

Useful links

522 questions
2
votes
0 answers

How to efficiently get the top 3 similar strings in a distance matrix using only one triangle section?

Consider the following Python code: from scipy.spatial.distance import pdist, squareform from fuzzywuzzy import fuzz import pandas as pd words = pd.DataFrame({'Words': ['horse', 'dog', 'food', 'hhorse', 'doggy']}) distance_matr = pdist(words,…
Snowflake
  • 2,869
  • 3
  • 22
  • 44
2
votes
1 answer

Python: fuzzywuzzy, the output of the first value is correct, the others are NaN

I'm stuck in a very strange problem: I have two dfs and I have to match strings of one df with the strings of the other df, by similarity. The target column is the name of the television program (program_name_1 & program_name_2). In order to let him…
Laga
  • 23
  • 5
2
votes
2 answers

How to combine the fuzzy function with apply(lambda x: ) function?

I have 2 dataframes df1 and df2 like this: df1: Id Name 1 Tuy Hòa 2 Kiến thụy 3 Bình Tân df2: code name A1 Tuy Hoà A2 Kiến Thụy A3 Tân Bình Now when I use merge: out_df = pd.merge(df1, df2,…
Tung Nguyen
  • 410
  • 3
  • 11
2
votes
2 answers

Get most similar value from dataframe column to specific string python

I want to find the most similar value from a dataframe column to a specified string , e.g. a='book'. Let's say the dataframe looks like: df col1 wijk 00 book Wijk a test Now I want to return wijk 00 book since this is the most similar to a. I am…
baqm
  • 121
  • 6
2
votes
1 answer

pandas: calculate fuzzywuzzy for each category separately

I have a dataset as follows, only with more rows: import pandas as pd data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'], 'Second': ['the old man is here','the young girl is there', 'the old…
zara kolagar
  • 881
  • 3
  • 15
2
votes
1 answer

How do I get additional column name information in a pandas group by / nlargest calculation?

I am comparing pairs of strings using six fuzzywuzzy ratios, and I need to output the top three scores for each pair. This line does the job: final2_df = final_df[['nameHiringOrganization', 'mesure', 'name',…
davidv
  • 71
  • 7
2
votes
1 answer

Python FuzzyWuzzy ratio: how does it work?

Inside the FuzzyWuzzy ratio description it says: The FuzzyWuzzy ratio raw score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is defined by int(round((2.0 * M / T) * 100)) where T is the…
s900n
  • 3,115
  • 5
  • 27
  • 35
2
votes
1 answer

Group by fuzzy string matches with fuzzywuzzy and groupby

I have a dataset of random words and names and I am trying to group all of the similar words and names. So given the dataframe below: Name ID Value 0 James 1 10 1 James 2 2 …
DrakeMurdoch
  • 765
  • 11
  • 26
2
votes
1 answer

python fuzzywuzzy fuzzy matching - exclude terms

I am fairly new to python, have been using fuzzywuzzy to do some fuzzy matching with success. I am wondering, however, if there is way to exclude terms from the algorithm? Generic terms can often be matched to a ton of options, and I would like to…
2
votes
1 answer

Basic question - iterating through pandas dataframe column using a function

I am struggling with the basics. I have just one column with names in pandas dataframe and I want to compare strings for potential duplicates using 3-4 functions from fuzzywuzzy library. So first name I want to check against the rest of the column…
cnns
  • 151
  • 7
2
votes
1 answer

Using fuzzy wuzzy to match names (Issue!) Not performing as expected?

I want to name match appropriately, but as can be seen below it's not the match I wanted is there any way to get around this? I just want Mr Mark Longfield to be preferred over Mr Laurence Boode as it is more likely to be the correct match. from…
user11357465
2
votes
1 answer

Fuzzy match columns and merge/join dataframes

I am trying to merge 2 dataframes with multiple columns each based on matching values at one of the columns on each of them. This code from @Erfan does a great job fuzzymatching the target columns, but is there a way to carry the rest of columns…
pyproper
  • 53
  • 6
2
votes
1 answer

How to compare row by row in a dataframe

I have a data frame that has a name and the URL ID of the name. For example: Abc 123 Abc.com 123 Def 345 Pqr 123 PQR.com 123 Here due to data extraction error, at times different names have same ID. I want…
asspsss
  • 103
  • 1
  • 1
  • 8
2
votes
1 answer

Fuzzy matching from string candidate list

I've got a list of company names that I am trying to parse from a large number of PDF documents. I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in. I'm stuck trying to use some…
Jack McPherson
  • 135
  • 1
  • 8
2
votes
1 answer

fuzzy duplicate check using python dedupe library error

I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error: {'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'}, 'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'}, 'Invoice Date':…
python_rok
  • 61
  • 1
  • 9