1

Firstly, note I'm a python newbie, so any apologies in advance. I have however researched this for the last day or 2 with no luck - hence my first post here.

I need to fuzzy match data based on 'Name' in a CSV file in the following format:

Code,Name,Location
123,Test data,LON
456,Data test,LON
789,Other,LON
1234,Test data,NYC

The problem I'm having however, is I want fuzzzywuzzy to only look at the data of the same location code to that in the iteration. So on my first loop, 'Test data, LON' should not match 'Test data, NYC'.

This is what I have so far:

import pandas as pd
import numpy as np
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

data = pd.read_csv('data.csv', delimiter=',', usecols=['Code', 'Name', 'Location'])

for index, row in data.iterrows():
    location = row['Location']
    name = row['Name']
    dd = data[data.Location == location ][['Name']]
    result = process.extractBests(name, dd, limit=3)
    print(result)    

The idea behind the above being, to loop through my DataFrame extracting the location and using that as a filter to make a subset of data for fuzzywuzzy to match with.

Any help, or a nudge in the right direction, would be greatly appreciated. Thanks.

Edit

I'd like the match output as follows, I can then look into laying this out as I see fit:

('Test data', [('Test data', 100, 0), ('Test data', 100, 3), ('Data test', 95, 1), ('Other', 34, 2)])
('Data test', [('Data test', 100, 1), ('Test data', 95, 0), ('Test data', 95, 3), ('Other', 36, 2)])

This data should only containing matches from the same Location though.

As further context, I have 110k rows of data with variations in the Name column, I'd like to find these variations. I only care about matches within the same Location so do not deem it necessary to fuzzy look up based on my whole 110k dataset.

Community
  • 1
  • 1
Jack
  • 13
  • 3
  • You might want to look into [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) - grouping by location and then fuzzy-searching the `Name` field – Josh Friedlander Feb 20 '19 at 14:30
  • From your example given what matching would you like to achieve (i.e. do you want all the London's grouped or just the top 2 London's) – Dillon Feb 20 '19 at 14:48
  • Thanks for the replies JoshFriedlander and Dillon Josh, I'll check out groupby, thanks. Dillon, I've edited my post with a few additional details. – Jack Feb 20 '19 at 15:05

1 Answers1

0

How about using your column "Location" as a list and iterate through this list:

import pandas as pd
import numpy as np
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

data = pd.read_excel('data.xlsx')
location = list(data['Location'].drop_duplicates())
for i in location:
    datafiltered = data[data['Location'] == i ]
    for j in datafiltered['Name']:
        result = process.extractBests(j, datafiltered['Name'], limit=3)
        print(result) 

I hope it helps. BR

RenauV
  • 363
  • 2
  • 11