I have two CSV files. One that contains Vendor data and one that contains Employee data. Similar to what "Fuzzy Lookup" in excel does, I'm looking to do two types of matches and output all columns from both csv files, including a new column as the similarity ratio for each row. In excel, I would use a 0.80 threshold. The below is sample data and my actual data has 2 million rows in one of the files which is going to be a nightmare if done in excel.
Output 1: From Vendor file, fuzzy match "Vendor Name" with "Employee Name" from Employee file. Display all columns from both files and a new column for Similarity Ratio
Output 2: From Vendor file, fuzzy match "SSN" with "SSN" from Employee file. Display all columns from both files and a new column for Similarity Ratio
These are two separate outputs
Dataframe 1: Vendor Data
Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN |
---|---|---|---|---|---|---|
15 | 58421 | CLIFFORD BROWN | 854 | 500 | Misc | 668419628 |
150 | 9675 | GREEN | 7412 | 70 | One Time | 774801971 |
200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 |
200 | 69997 | HAROON, SIMAN | 964 | 100 | Misc | 741-98-7821 |
Dataframe 2: Employee Data
Employee Name | Employee ID | Manager | SSN |
---|---|---|---|
BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 |
BLUE, CITY | 2 | Manager 2 | 874126487 |
SMITH, JOHN | 3 | Manager 3 | 965-21-4872 |
HAROON, SIMON | 4 | Manager 4 | 741-98-7820 |
Expected output 1 - Match Name
Employee Name | Employee ID | Manager | SSN | Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN | Similarity Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 | 150 | 58421 | CLIFFORD BROWN | 854 | 500 | Misc | 668419628 | 1.00 |
SMITH, JOHN | 3 | Manager 3 | 965-21-4872 | 200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 | 1.00 |
HAROON, SIMON | 4 | Manager 4 | 741-98-7820 | 200 | 69997 | HAROON, SIMAN | 964 | 100 | Misc | 741-98-7821 | 0.96 |
BLUE, CITY | 2 | Manager 2 | 874126487 | 0.00 |
Expected output 2 - Match SSN
Employee Name | Employee ID | Manager | SSN | Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN | Similarity Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 | 150 | 58421 | CLIFFORD, BROWN | 854 | 500 | Misc | 668419628 | 0.97 |
SMITH, JOHN | 3 | Manager 3 | 965-21-4872 | 200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 | 0.97 |
BLUE, CITY | 2 | Manager 2 | 874126487 | 0.00 | |||||||
HAROON, SIMON | 4 | Manager 4 | 741-98-7820 | 0.00 |
I've tried the below code:
import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')
matched_names = []
for row1 in df1.index:
name1 = df1._get_value(row1, 'Vendor Name')
for row2 in df2.index:
name2 = df2._get_value(row2, 'Full Name')
match = fuzz.ratio(name1, name2)
if match > 80: # This is the threshold
match.append([name1, name2, match])
df_ratio = pd.DataFrame(columns=['Vendor Name', 'Employee Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv', encoding='utf-8')
I'm just not getting the results I want and am ready to reinvent the whole script. Any suggestions would help to improve my script. Please note, I'm fairly new to Python so be gentle. I am totally open to a new approach on this example.
September 23 Update: Still having trouble...I'm able to get the similarity ratio now but not getting all the columns from both CSV files. The issue is that both files are completely different so when I concat, it gives NaN values. Any suggestions? New code below:
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
df1['full_name']= df1['Vendor Name']
df2['full_name'] = df2['Employee Name']
df1_name = df1['full_name']
df2_name = df2['full_name']
frames = [pd.DataFrame(df1), pd.DataFrame(df2)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
dfresult = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes
listOfDfs = [dfresult.loc[idx] for idx in np.split(dfresult.index, df.shape[0])]
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(dfresult.shape[0])}
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
df = pd.DataFrame(list(DataFrameDict.items())).df.to_excel(r'Directory\TestOutput.xlsx', index = False)