I have two data frames. DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.