OBJECTIVE
Take Company B's Accounting Description (e.g "Cash") and match them to Company A's accounting description (e.g "Cash Rollup").
APPROACH
- Record Company A and Company B's Accounting Descriptions, place each into their own dataframes (assign Company A's Accounting Descriptions as "approved_accts"). Make Company A's Accounting descriptions =
approved_accts
- Using python's fuzzywuzzy package, compare Company B's Accounting Descriptions with the
approved_accts
. If the fuzzywuzzy match is above a certain threshold, append the "approve_accts" label to Company B's Accounting Descriptions (preferably in an adjacent column so Company B's original description isn't lost).
CODE
import pandas as pd
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
fileLoc = r'...\mapping.xlsx'
xls = pd.ExcelFile(fileLoc)
df_plp = xls.parse('PLP')
df_nbg = xls.parse('NBG')
df_plp.head(10)
#Confirmed the plp.head layout matches the excel file
df_nbg.head(10)
#Confirmed the nbg.head layout matches the excel file
#pull out the Account Descriptions for comparison
approved_accts = df_nbg['Account Description']
plp_accts = df_plp['Account Description']
#Test fuzzywuzzy
process.extract('Cash', 'Cash Cash Rollup')
OUTPUT
process.extract('Cash', 'Cash Cash Rollup')
[('C', 90), ('a', 90), ('s', 90), ('h', 90), ('C', 90)]
process.extractOne('Cash', 'Cash Cash Rollup')
('C', 90)
fuzz.token_set_ratio('Cash', 'Cash Cash Rollup')
100
QUESTIONS
In output #1, 2 - it appears that FuzzyWuzzy is breaking down each individual character for comparison (which is incorrect). I've looked online and found the following github issue. However, just yesterday this implementation was working with another example. I was wondering, what could the cause of this be?
For each Account Description, there is a unique ID ("Account Number"). Is there a way that I can carry this unique ID forward such that Company B's "Cash" account (Account ID: B123) is matched with Company A's approved list "Cash Cash Rollup" (Account ID: A321) so that I can see which account numbers need to be remapped? Essentially, these Account Numbers are just "along for the ride" when we are comparing each Company's account description, but will become valuable when I dive into Accounting software/ERP systems.
Are there any other libraries that you recommend using (instead of fuzzywuzzy) to assist in the process?