Multiprocessing the Fuzzy match in pandas

Question

I have two data frames. DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having

Project_Id, Project_Start_Date and Project_Address

I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])

def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
    x, choices=choices, score_cutoff=cutoff
)

Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
    Df_Project ["Project_Address"], 
    80
     )
        )

This code does provide an output in the form of a tuple

('matched_string', score)

But it is also giving similar strings. Also I need to extract

Project_Id and Project_Start_Date

. Can someone help me to achieve this using parallel processing as the data is huge.

score 1 · Accepted Answer · answered Aug 13 '20 at 18:19

1

You can convert the tuple into dataframe and then join out to your base data frame.

import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)

Output:

  address  random_stuff  distance
0     abc           100        10
1     cdf           200        20

answered Aug 13 '20 at 18:19

SAL

597
3
17

You're not fuzzymatching. Pandas' merge makes a direct comparation of strings, they must be identical. – Marcos Lima Sep 16 '21 at 19:18

Multiprocessing the Fuzzy match in pandas

1 Answers1