Questions tagged [fuzzywuzzy]

FuzzyWuzzy is a Python package to perform fuzzy string matching.

FuzzyWuzzy is a Python package to perform fuzzy string matching.

Useful links

522 questions
2
votes
1 answer

How to do a fuzzy match in PySpark UDF?

I am trying to run following code to generate an additional col in pyspark df. The idea is to take the col from pyspark df and get the max of the scores by comparing the col with the list of keywords I have. (e.g. choices) def get_max_sore(col): …
2
votes
4 answers

By how much percentage do the two strings match?

I have 2 columns of disease names, I have to try and match the best options. I tried using "SequenceMatcher" module and "fuzzywuzzy" module in python and the results were surprising. I have pasted the results and my doubts below: Consider there is a…
2
votes
1 answer

Python multiprocessing not working as intended with fuzzywuzzy

Either my processes kicking off one after another finishes or they start (simultaneously) but without calling the pointing function. I tried many variants somehow it will not act like many tutorials teach. My Goal is to fuzzywuzzy String match a 80k…
2
votes
3 answers

How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio

I have a dataframe df_sample with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df. Both df_sample and df share the exact same structure: zip_code city state …
DrakeMurdoch
  • 765
  • 11
  • 26
2
votes
2 answers

How to Find Company Names in Text Using Python

I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc may appear as…
user53526356
  • 934
  • 1
  • 11
  • 25
2
votes
0 answers

calculation of distance matrix in a faster approach

I have a dataframe import numpy as np from fuzzywuzzy import fuzz from fuzzywuzzy import process import pandas as pd a = {'b':['cat','bat','cat','cat','bat','No Data','bat','No Data']} df11 =…
Vas
  • 918
  • 1
  • 6
  • 19
2
votes
2 answers

How to compare strings more efficiently when using fuzzywuzzy?

I have a CSV file with ~20000 words and I'd like to group the words by similarity. To complete such task, I am using the fantastic fuzzywuzzy package, which seems to work really well and achieves exactly what I am looking for with a small dataset…
Dalvtor
  • 3,160
  • 3
  • 21
  • 36
2
votes
2 answers

Rearrange words using Levenshtein distance

Summary I am trying to find name matching percentage in php but before that I need to rearrange the words in string according to 1st string. What is the source code about? I have two strings. First I am adding both strings to array if space is found…
user10655999
2
votes
1 answer

Apply fuzzy matching score at two columns of a dataframe

I have dataframe: df = original_title title Mexico Oil Gas Summit Mexico Oil Gas Summit I have to fuzzy match the entities of these two(original_title & title) columns and…
SaNa
  • 333
  • 1
  • 3
  • 13
2
votes
1 answer

Python Record Linkage, Fuzzy Match and Deduplication

I have 3 dataset of customers with 7 columns. CustomerName Address Phone StoreName Mobile Longitude Latitude every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same…
Dr Sima
  • 135
  • 1
  • 12
2
votes
1 answer

Most Likely Word Based on Max Levenshtien Distance

I have a list of words: lst = ['dog', 'cat', 'mate', 'mouse', 'zebra', 'lion'] I also have a pandas dataframe: df = pd.DataFrame({'input': ['dog', 'kat', 'leon', 'moues'], 'suggested_class': ['a', 'a', 'a', 'a']}) input suggested_class dog …
Luxo_Jr
  • 379
  • 1
  • 3
  • 12
2
votes
2 answers

How can I check a Pandas DataFrame's column against itself?

I have a Pandas DataFrame with two relevant columns. I need to check column A (a list of names) against itself, and if two (or more) values are similar enough to each other, I sum the values in column B for those rows. To check similarity, I'm using…
phltop
  • 23
  • 3
2
votes
1 answer

Python Pandas Column and Fuzzy Match + Replace

Intro Hello, I'm working on a project that requires me to replace dictionary keys within a pandas column of text with values - but with potential misspellings. Specifically I am matching names within a pandas column of text and replacing them with…
2
votes
2 answers

python3 fuzzywuzzy not returning index value of the array

I am trying to modify the fuzzywuzzy library. The module process returns the score and the array element. But I want it to return the index of the element along with the group of score,item,index. Here is what I tried: #!/usr/bin/env python #…
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
2
votes
1 answer

How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to…
Suyash P
  • 21
  • 1
  • 2