Firstly, note I'm a python newbie, so any apologies in advance. I have however researched this for the last day or 2 with no luck - hence my first post here.
I need to fuzzy match data based on 'Name' in a CSV file in the following format:
Code,Name,Location
123,Test data,LON
456,Data test,LON
789,Other,LON
1234,Test data,NYC
The problem I'm having however, is I want fuzzzywuzzy
to only look at the data of the same location code to that in the iteration.
So on my first loop, 'Test data, LON' should not match 'Test data, NYC'.
This is what I have so far:
import pandas as pd
import numpy as np
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
data = pd.read_csv('data.csv', delimiter=',', usecols=['Code', 'Name', 'Location'])
for index, row in data.iterrows():
location = row['Location']
name = row['Name']
dd = data[data.Location == location ][['Name']]
result = process.extractBests(name, dd, limit=3)
print(result)
The idea behind the above being, to loop through my DataFrame extracting the location and using that as a filter to make a subset of data for fuzzywuzzy to match with.
Any help, or a nudge in the right direction, would be greatly appreciated. Thanks.
Edit
I'd like the match output as follows, I can then look into laying this out as I see fit:
('Test data', [('Test data', 100, 0), ('Test data', 100, 3), ('Data test', 95, 1), ('Other', 34, 2)])
('Data test', [('Data test', 100, 1), ('Test data', 95, 0), ('Test data', 95, 3), ('Other', 36, 2)])
This data should only containing matches from the same Location
though.
As further context, I have 110k rows of data with variations in the Name
column, I'd like to find these variations. I only care about matches within the same Location
so do not deem it necessary to fuzzy look up based on my whole 110k dataset.