0

OBJECTIVE

Take Company B's Accounting Description (e.g "Cash") and match them to Company A's accounting description (e.g "Cash Rollup").

APPROACH

  1. Record Company A and Company B's Accounting Descriptions, place each into their own dataframes (assign Company A's Accounting Descriptions as "approved_accts"). Make Company A's Accounting descriptions = approved_accts
  2. Using python's fuzzywuzzy package, compare Company B's Accounting Descriptions with the approved_accts. If the fuzzywuzzy match is above a certain threshold, append the "approve_accts" label to Company B's Accounting Descriptions (preferably in an adjacent column so Company B's original description isn't lost).

CODE

import pandas as pd
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

fileLoc = r'...\mapping.xlsx'
xls = pd.ExcelFile(fileLoc)

df_plp = xls.parse('PLP')
df_nbg = xls.parse('NBG')

df_plp.head(10)
#Confirmed the plp.head layout matches the excel file

df_nbg.head(10)
#Confirmed the nbg.head layout matches the excel file

#pull out the Account Descriptions for comparison
approved_accts = df_nbg['Account Description']
plp_accts = df_plp['Account Description']

#Test fuzzywuzzy
process.extract('Cash', 'Cash Cash Rollup')

OUTPUT

process.extract('Cash', 'Cash Cash Rollup')

[('C', 90), ('a', 90), ('s', 90), ('h', 90), ('C', 90)]

process.extractOne('Cash', 'Cash Cash Rollup')

('C', 90)

fuzz.token_set_ratio('Cash', 'Cash Cash Rollup')

100

QUESTIONS

  1. In output #1, 2 - it appears that FuzzyWuzzy is breaking down each individual character for comparison (which is incorrect). I've looked online and found the following github issue. However, just yesterday this implementation was working with another example. I was wondering, what could the cause of this be?

  2. For each Account Description, there is a unique ID ("Account Number"). Is there a way that I can carry this unique ID forward such that Company B's "Cash" account (Account ID: B123) is matched with Company A's approved list "Cash Cash Rollup" (Account ID: A321) so that I can see which account numbers need to be remapped? Essentially, these Account Numbers are just "along for the ride" when we are comparing each Company's account description, but will become valuable when I dive into Accounting software/ERP systems.

  3. Are there any other libraries that you recommend using (instead of fuzzywuzzy) to assist in the process?

jonplaca
  • 797
  • 4
  • 16
  • 34

2 Answers2

0

RE 1: It seems like FuzzyWuzzy's process functions expect a list as the second argument, so perhaps it's parsing your string as a list of characters? See fuzzy wuzzy documentation. It probably also makes sense to use the pandas series/df .unique() method to avoid having to deal with possible duplicates.

RE 2: If you have a table with columns A and B and another table with columns C and D and you have a mapping from A to C (from fuzzy string matching for example), you can get a mapping from B to D by joining the two data frames on A and C and then looking for all unique pairs of B and D. Something like this: joined_df[['B', 'D']].unique().

RE 3: I've personally used FuzzyWuzzy to join two separate data frames on a fuzzy string match, and short of writing your own library to accomplish the same task I think it's probably the best approach there is. It's fast and relatively easy to define a function that takes a series and uses FuzzyWuzzy's ExtractOne to select and return the closest match.

Petergavinkin
  • 150
  • 11
0

Is you check the documentation for process.extract https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py#L132

you need to pass in an iterable (a list or a dictionary)

Below works:

process.extract('Cash', 'Cash Cash Rollup'.split(' '))

[('Cash', 100), ('Cash', 100), ('Rollup', 0)]

Aks
  • 932
  • 2
  • 17
  • 32