Questions tagged [fuzzywuzzy]

FuzzyWuzzy is a Python package to perform fuzzy string matching.

FuzzyWuzzy is a Python package to perform fuzzy string matching.

Useful links

522 questions
3
votes
1 answer

Pyspark levenshtein Join error

I want to perform a join based on Levenshtein distance. I have 2 tables: Data: Which is a CSV in HDFS file repository. one of the columns is Disease description, 15K rows. df7_ct_map: a table I call from Hive. one of the columns is Disease…
Lizou
  • 863
  • 1
  • 11
  • 16
3
votes
2 answers

PySpark throws ImportError, but Module actually exists and works well

I am using Cloudera and the Spark version is 2.1.0. I was trying to crossJoin two tables and create a column with fuzzy match ratio (thus I need to import fuzzywuzzy). Here is the code: from fuzzywuzzy import fuzz def fuzzy_ratio(x,y): from…
3
votes
1 answer

Error while running fuzzywuzzy/fuzz.py

I have a program that uses fuzzywuzzy to match to csvs and find any strings that might be duplicates or very similar. When I compare my two files, fuzzywuzzy raises the following error: WARNING:root:Applied processor reduces input query to empty…
Ecanales
  • 81
  • 1
  • 8
3
votes
1 answer

Python's fuzzywuzzy returns unpredictable results

I'm working with fuzzy wuzzy in python and while it claims it works with a levenshtein distance, I find that many strings with a single character different produce different results. For…
3
votes
2 answers

How to fuzzy match movie titles with difflib and pandas?

I have 2 lists of potentially overlapping movie titles, but possibly written in a different form. They are in 2 different dataframes from pandas. So I have tried to use the map() function with the fuzzywuzzy library like so: df1.title.map(lambda x:…
Bastian
  • 5,625
  • 10
  • 44
  • 68
3
votes
1 answer

Pandas and Fuzzy Match

Currently I have two data frames. I am trying to get a fuzzy match of client names using fuzzywuzzy's process.extractOne function. When I have run the following script on sample data I get good results and no error, but when I run the following on…
gatherer
  • 43
  • 9
3
votes
2 answers

Python fuzzywuzzy error string or buffer expect

I'm using fuzzywuzzy to find near matches in a csv of company names. I'm comparing manually matched strings with the unmatched strings in the hope of finding some useful proximity matches, however, I'm getting a string or buffer error within…
woodbine
  • 553
  • 6
  • 26
3
votes
1 answer

Import Error: No module named 'utils'

Please excuse me I'm a newbie. I'm trying to use the fuzzywuzzy module from seatgeek. I am using Python 3 Initially, I was getting this error: from fuzzywuzzy import fuzz ImportError: cannot import name fuzz I changed the import statement to…
shoi
  • 167
  • 1
  • 3
  • 7
2
votes
1 answer

TheFuzz Library in Python - Ratio Function

I'm trying to understand how the ratio is calculated in this function. I've been searching all over the internet and even asking ChatGPT about it, but I can't find a single answer. The issue is that, for example: from thefuzz import fuzz #fuzzywuzzy…
2
votes
3 answers

Python Speeding Up a Loop for text comparisons

I have a loop which is comparing street addresses. It then uses fuzzy matching to tokenise the addresses and compare the addresses. I have tried this both with fuzzywuzzy and rapidfuzz. It subsequently returns how close the match is. The aim is to…
John Smith
  • 2,448
  • 7
  • 54
  • 78
2
votes
3 answers

How to find the longest common substring in a list of strings (>2 strings)? Trying FuzzyWuzzy and Sequence matcher

So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a…
msci
  • 29
  • 2
2
votes
1 answer

is there a way to clean text with typos in a column to make these text items the same as in reference column which contain the same items?

I have two excel files each file contains columns [product name, price, quantity]. one is the reference table (drug_list) which contain correct (product name) items (14000 item) however the other file (order-1) contains (product name) items (3000…
2
votes
1 answer

Replacing similar strings in the column by using the same for both

I'm encountering the following issue during a small project of mine. I'm having a large dataset where some string values are accidentally not written properly. My goal is to write a function that ensures that all names that look fairly similar (.75)…
DataDude
  • 136
  • 7
2
votes
1 answer

How to set a column value by fuzzy string matching with another dataframe?

I have referred to this post but cannot get it to run for my particular case. I have two dataframes: import pandas as pd df1 = pd.DataFrame( { "ein": {0: 1001, 1: 1500, 2: 3000}, "ein_name": {0: "H for Humanity", 1: "Labor…
2
votes
0 answers

How to install python-Levenshtein on Windows? Getting an error when using fuzzywuzzy?

I'm writing a program that returns a list of tuples, each tuple consisting of closely matching strings present in two lists A and B using the python module fuzzywuzzy. However, on running this code, I'm getting this error: UserWarning: Using slow…
user13987693