fuzzy duplicated with pandas

Question

I have 1 DataFrame contain 2 columns of string data. i need to compare columns 'NameTest'and'Name'. and i want each name in columns'NameTest' compare too all name in columns 'Name'. and if they matching more than 80% print closest match name.

*My dataframe

	NameTest	Name
0	john carry	john carrt
1	alex midlane	john crat
2	robert patt	alex mid
3	david baker	alex
4	NaN	patt
5	NaN	robert
6	NaN	david baker

My Code

from fuzzywuzzy import fuzz, process
import pandas as pd
import numpy as np
import difflib
cols = ["Name", "NameTest"]
df = pd.read_excel(
    r'D:\FFOutput\name.xlsx', usecols=cols,)  # Read Excel



for i, row in df.iterrows():
    na = row.Name
    ne = row.NameTest
    print([ne, na])
    for i in na:
        c = difflib.SequenceMatcher(isjunk=None, a=ne, b=na)
        diff = c.ratio()*100
        diff = round(diff, 1)
    if diff >= 80:
        print(na, diff)

Any suggestions?

Thank you for your help

score 0 · Accepted Answer · answered Feb 17 '21 at 10:27

For this purpose FuzzyWuzzy provides process.extractOne, which searches for the best match above a score threshold. Searching through the Names len(df) times requires len(df) * len(df) comparisions (assuming no elements are np.nan), which can become very time consuming for bigger tables. Thats why I am going to use RapidFuzz (I am the author) in my answer, which is a lot faster. You can however simply replace the import statement with fuzzywuzzy in case performance is not relevant for the task.

You could rewrite your code in the following way:

import numpy as np
import pandas as pd
from rapidfuzz import process, fuzz

df = pd.DataFrame({
"NameTest": ["john carry", "alex midlane", "robert patt", "david baker", np.nan, np.nan, np.nan],
"Name": ["john carrt", "john crat", "alex mid", "alex", "patt", "robert", "david baker"]
})

# filter out non strings, since they are notsupported by rapidfuzz/fuzzywuzzy/difflib
Names = [name for name in df["Name"] if isinstance(name, str)]

for NameTest in df["NameTest"]:
  if isinstance(NameTest, str):
    match = process.extractOne(
      NameTest, Names,
      scorer=fuzz.ratio,
      processor=None,
      score_cutoff=80)

    if match:
      print(match[0], match[1])

which prints:

john carrt 90.0
alex mid 80.0
david baker 100.0

I really want to thank you for your help. – kittithep chantapatchanee Feb 17 '21 at 11:33 — kittithep chantapatchanee, Feb 17 '21 at 11:33
hello sir how i convert the prints to dataframe. Thanks – kittithep chantapatchanee Mar 26 '21 at 13:51 — kittithep chantapatchanee, Mar 26 '21 at 13:51

fuzzy duplicated with pandas

1 Answers1