0

So I have a CSV file full of transactions, with the vendor name in one column and the transaction amount in another. The goal is to find the top vendors in terms of total number of transactions. That part is pretty simple, and I have code like so:

with open('Transactions.csv') as Vendor_Data:
    file_reader = csv.reader(Vendor_Data, delimiter=',')
    vendor_dict = {}
    next(file_reader)
    for row in file_reader:
        if row[3] not in vendor_dict:
            vendor_dict[row[3]] = [0, 0]
            vendor_dict[row[3]][1] += round(float(row[1]), 2)
        else:
            vendor_dict[row[3]][0] += 1
            vendor_dict[row[3]][1] += round(float(row[1]), 2)

The problem is, there are many entries where the same vendor is spelled slightly different ("Delta Airlines" v. "Delta Air"). What is the best way to detect these similar string names (say, with Fuzzywuzzy) while looping through the CSV file, and consolidate the transaction instance and amounts?

2 Answers2

0
import csv

from fuzzywuzzy import fuzz

with open('Transactions.csv') as Vendor_Data:
    file_reader = csv.reader(Vendor_Data, delimiter=',')
    vendor_dict = {}
    next(file_reader)  # skipping a header?
    for row in file_reader:

        # we can't use the dictionary directly (e.g. "key in vendor_dict")
        # because we want to do a similarity search.
        csv_name = row[3]
        for vendor_name, vendor_values in vendor_dict.iteritems():

            # this is *a* way to do it. You may want to use different scores
            # or even a different comparison
            if fuzz.token_set_ratio(csv_name, vendor_name) > 80:
                vendor_values[0] += 1
                vendor_values[1] += round(float(row[1]), 2)
                break
        else:
            # we didn't find anything similar enough, so create an entry
            vendor_values = [0, 0]
            vendor_values[1] += round(float(row[1]), 2)

        vendor_dict[csv_name] = vendor_values
Chris Curvey
  • 9,738
  • 10
  • 48
  • 70
  • I think the fuzzywuzzy part of this works, but the end result is an empty dictionary. I wonder if it's because when you assign values to vendor_name, vendor_values from vendor_dict.iteritems() (which in Python 3.x is just vendor_dict.items(), right?) it isn't actually reflecting back onto the original vendor_dict dictionary? – OnsideBump Jul 06 '19 at 21:16
  • it's because my indentation is wrong on the last line. Let me fix – Chris Curvey Jul 07 '19 at 00:03
0

Read the csv file in pandas. Then add a new column for the fuzzywuzzy percent match.

Create a threshold as to which percentage shall be considered same string then do the calculation afterwards by filtering using isin() method then adding the values of the columns for the transaction amount.

Loop this to your entire DataFrame and you'll get the result you need.

Joe
  • 879
  • 2
  • 6
  • 15