Fuzzy Lookup In Python

Question

I have two CSV files. One that contains Vendor data and one that contains Employee data. Similar to what "Fuzzy Lookup" in excel does, I'm looking to do two types of matches and output all columns from both csv files, including a new column as the similarity ratio for each row. In excel, I would use a 0.80 threshold. The below is sample data and my actual data has 2 million rows in one of the files which is going to be a nightmare if done in excel.

Output 1: From Vendor file, fuzzy match "Vendor Name" with "Employee Name" from Employee file. Display all columns from both files and a new column for Similarity Ratio

Output 2: From Vendor file, fuzzy match "SSN" with "SSN" from Employee file. Display all columns from both files and a new column for Similarity Ratio

These are two separate outputs

Dataframe 1: Vendor Data

Company	Vendor ID	Vendor Name	Invoice Number	Transaction Amt	Vendor Type	SSN
15	58421	CLIFFORD BROWN	854	500	Misc	668419628
150	9675	GREEN	7412	70	One Time	774801971
200	15789	SMITH, JOHN	80	40	Employee	965214872
200	69997	HAROON, SIMAN	964	100	Misc	741-98-7821

Dataframe 2: Employee Data

Employee Name	Employee ID	Manager	SSN
BROWN, CLIFFORD	1	Manager 1	668-419-628
BLUE, CITY	2	Manager 2	874126487
SMITH, JOHN	3	Manager 3	965-21-4872
HAROON, SIMON	4	Manager 4	741-98-7820

Expected output 1 - Match Name

Employee Name	Employee ID	Manager	SSN	Company	Vendor ID	Vendor Name	Invoice Number	Transaction Amt	Vendor Type	SSN	Similarity Ratio
BROWN, CLIFFORD	1	Manager 1	668-419-628	150	58421	CLIFFORD BROWN	854	500	Misc	668419628	1.00
SMITH, JOHN	3	Manager 3	965-21-4872	200	15789	SMITH, JOHN	80	40	Employee	965214872	1.00
HAROON, SIMON	4	Manager 4	741-98-7820	200	69997	HAROON, SIMAN	964	100	Misc	741-98-7821	0.96
BLUE, CITY	2	Manager 2	874126487								0.00

Expected output 2 - Match SSN

Employee Name	Employee ID	Manager	SSN	Company	Vendor ID	Vendor Name	Invoice Number	Transaction Amt	Vendor Type	SSN	Similarity Ratio
BROWN, CLIFFORD	1	Manager 1	668-419-628	150	58421	CLIFFORD, BROWN	854	500	Misc	668419628	0.97
SMITH, JOHN	3	Manager 3	965-21-4872	200	15789	SMITH, JOHN	80	40	Employee	965214872	0.97
BLUE, CITY	2	Manager 2	874126487								0.00
HAROON, SIMON	4	Manager 4	741-98-7820								0.00

I've tried the below code:

import pandas as pd
from fuzzywuzzy import fuzz

df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')

matched_names = []

for row1 in df1.index:
    name1 = df1._get_value(row1, 'Vendor Name')  
    for row2 in df2.index:
        name2 = df2._get_value(row2, 'Full Name')  
        match = fuzz.ratio(name1, name2)
        if match > 80:  # This is the threshold
            match.append([name1, name2, match])

df_ratio = pd.DataFrame(columns=['Vendor Name', 'Employee Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv',  encoding='utf-8')

I'm just not getting the results I want and am ready to reinvent the whole script. Any suggestions would help to improve my script. Please note, I'm fairly new to Python so be gentle. I am totally open to a new approach on this example.

September 23 Update: Still having trouble...I'm able to get the similarity ratio now but not getting all the columns from both CSV files. The issue is that both files are completely different so when I concat, it gives NaN values. Any suggestions? New code below:

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Workday Data.xlsx')

df1['full_name']= df1['Vendor Name']
df2['full_name'] = df2['Employee Name']

df1_name = df1['full_name']
df2_name = df2['full_name']

frames = [pd.DataFrame(df1), pd.DataFrame(df2)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
dfresult = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes 
listOfDfs = [dfresult.loc[idx] for idx in np.split(dfresult.index, df.shape[0])]

DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(dfresult.shape[0])}

for name in DataFrameDict.keys():
    print(name)
    #print(DataFrameDict[name]

df = pd.DataFrame(list(DataFrameDict.items())).df.to_excel(r'Directory\TestOutput.xlsx', index = False)

ronpi · Accepted Answer · 2021-09-24T06:03:32.833

To concatenate the two DataFrames horizontally, I aligned the Employees DataFrame by the index of the matched Vendor Name. If no Vendor Name was matched, I just put an empty row instead.

In more details:

I iterated over the vendor names, and for each vendor name, I added the index of the employee name with the highest score to a list of indices. Note that I added at most one matched employee record to each vendor name.
If no match was found (too low score), I added the index of an empty record that I have added manually to the Employees Dataframe.
This list of indices is then used to reorder the Employees DataDrame.
at last, I just merge the two DataFrame horizontally. Note that the two DataFrames at this point doesn't have to be of the same size, but in such a case, the concat method just fill the gap with appending missing rows to the smaller DataFrame.

The code is as follows:


import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process    # the new repository of fuzzywuzzy

# import dataframes
...

# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1

# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
    match = fuzzy_process.extractOne(
        query=vendor_name,
        choices=indexed_employee_names_dict,
        score_cutoff=80
    )
    score, index = match[1:] if match is not None else (0.0, index_of_empty)
    matched_employees.add(index)
    ordered_employees.append(index)
    scores.append(score)

# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()

merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)

For the matching according to the SSN columns, it can be done exactly in the same way, by just replacing the column names in the above code. Moreover, The process can be generalize to be a function that accepts DataFrames and column names:

def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
    # adding empty row
    df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
    index_of_empty = len(df2) - 1

    # matching between vendor and employee names
    indexed_strings_dict = dict(enumerate(df2[col2]))
    matched_indices = set()
    ordered_indices = []
    scores = []
    for s1 in df1[col1]:
        match = fuzzy_process.extractOne(
            query=s1,
            choices=indexed_strings_dict,
            score_cutoff=cutoff
        )
        score, index = match[1:] if match is not None else (0.0, index_of_empty)
        matched_indices.add(index)
        ordered_indices.append(index)
        scores.append(score)

    # detect unmatched employees to be positioned at the end of the dataframe
    missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
    ordered_indices.extend(missing_indices)
    ordered_df2 = df2.iloc[ordered_indices].reset_index()

    # merge rows of dataframes
    merged_df = pd.concat([df1, ordered_df2], axis=1)

    # adding the scores column and sorting by its values
    scores.extend([0] * len(missing_indices))
    merged_df["Similarity Ratio"] = pd.Series(scores) / 100
    return merged_df.sort_values("Similarity Ratio", ascending=False)


if __name__ == "__main__":
    vendors_df = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
    employees_df = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
    
    merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
    merged_df.to_excel("merged_by_names.xlsx", index=False)
    
    merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
    merged_df.to_excel("merged_by_ssn.xlsx", index=False)

the above code is resulted with the following outputs:

merged_by_names.xlsx

Company	Vendor ID	Vendor Name	Invoice Number	Transaction Amt	Vendor Type	SSN	index	Employee Name	Employee ID	Manager	SSN	Similarity Ratio
200	15789	SMITH, JOHN	80	40	Employee	965214872	2	SMITH, JOHN	3	Manager 3	965-21-4872	1
15	58421	CLIFFORD BROWN	854	500	Misc	668419628	0	BROWN, CLIFFORD	1	Manager 1	668-419-628	0.95
200	69997	HAROON, SIMAN	964	100	Misc	741-98-7821	3	HAROON, SIMON	4	Manager 4	741-98-7820	0.92
150	9675	GREEN	7412	70	One Time	774801971	4	nan	nan	nan	nan	0
nan	nan	nan	nan	nan	nan	nan	1	BLUE, CITY	2	Manager 2	874126487	0

merged_by_ssn.xlsx

Company	Vendor ID	Vendor Name	Invoice Number	Transaction Amt	Vendor Type	SSN	index	Employee Name	Employee ID	Manager	SSN	Similarity Ratio
200	69997	HAROON, SIMAN	964	100	Misc	741-98-7821	3	HAROON, SIMON	4	Manager 4	741-98-7820	0.91
15	58421	CLIFFORD BROWN	854	500	Misc	668419628	0	BROWN, CLIFFORD	1	Manager 1	668-419-628	0.9
200	15789	SMITH, JOHN	80	40	Employee	965214872	2	SMITH, JOHN	3	Manager 3	965-21-4872	0.9
150	9675	GREEN	7412	70	One Time	774801971	4	nan	nan	nan	nan	0
nan	nan	nan	nan	nan	nan	nan	1	BLUE, CITY	2	Manager 2	874126487	0

Hi Ronpi. Im trying to run this now but I'm getting the error: "NameError: name 'fuzzy_process' is not defined". — Mystical Me, Sep 23 '21 at 19:23
Fixed, I just renamed the imported fuzzy.process and forgot to add it — ronpi, Sep 23 '21 at 19:30
No, you are not missing, I just got it also that the column is missing, and also the sorting - I added it now. — ronpi, Sep 23 '21 at 19:41
@Meowzz just let me know if you were able to tun it and get the same result — ronpi, Sep 23 '21 at 19:49
You are God to me. Yes this worked and I also tested it on my actual data with no issues. *hugs* — Mystical Me, Sep 23 '21 at 20:12
The `sep` argument is used in the method `DataFrame.to_csv` to determine the delimiter between values. I used it before, but then I replaced it to the `to_excel` method, but I forgot to remove this argument. I fixed it now — ronpi, Sep 24 '21 at 06:07

Fuzzy Lookup In Python

1 Answers1