5

I have 2 large data sets that I have read into Pandas DataFrames (~ 20K rows and ~40K rows respectively). When I try merging these two DFs outright using pandas.merge on the address field, I get a paltry number of match compared to the number of rows. So I thought I would try to fuzzy string match to see if it improves the number of output matches.

I approached this by trying to create a new column in DF1 (20K rows) that was the result of applying the fuzzywuzzy extractone function on DF1[addressline] to DF2[addressline]. I shortly realized that this would take forever since it will be doing close to 1 billion comparisons.

Both of these datasets have "County" fields and my ask is this: is there a way to conditionally do a fuzzy string match on the "addressline" fields in both DFs based on the "county" fields being the same? Researching questions similar to mine I stumbled upon this discussion: Fuzzy logic on big datasets using Python

However I am still fuzzy (no pun intended) on how to go about grouping/blocking fields based on county. Any advice would be greatly appreciated!

import pandas as pd
from fuzzywuzzy import process

def fuzzy_match(x, choices, scorer, cutoff):
  return process.extractOne(x, choices = choices, scorer = scorer, score_cutoff= cutoff)[0]

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'ID':['X','U','X','Y']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'ID':['X','U','X','Y']}) 
test['Address1'] = test['Address1'].apply(lambda x: x.lower()) 
test2['Address1'] = test2['Address1'].apply(lambda x: x.lower()) 
test['FuzzyAddress1'] = test['Address1'].apply(fuzzy_match, args = (test2['Address1'], fuzz.ratio, 80))

I've added 2 images that are sample sets of the 2 different DFs imported into Excel. Not all the fields have been included since they aren't important to my question. To reiterate my end goal, I want a new column in one of the DFs that has the top result from fuzzy matching an address line with the other address lines in the 2nd DF but only for those lines where the counties match between both DFs. From there I plan to merge the two dfs, one on the fuzzy matched address and the address line column in the 2nd DF. Hopefully this doesn't sound confusing.

Community
  • 1
  • 1
Nirav
  • 53
  • 1
  • 1
  • 6
  • this is an interesting question but it took me a long time to read through and realise what you are asking. can you include code to create a runnable minimal example dataframe? it would only need 4 or 5 rows to illustrate what you're trying to do. – maxymoo Mar 17 '17 at 01:17
  • @maxymoo here is a sample as you requested: test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'ID':['X','U','X','Y']}) test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'ID':['X','U','X','Y']}) test['Address1'] = test['Address1'].apply(lambda x: x.lower()) test2['Address1'] = test2['Address1'].apply(lambda x: x.lower()) test['FuzzyAddress1'] = test['Address1'].apply(fuzzy_match, args = (test2['Address1'], fuzz.ratio, 80)) Here the groups will be 'ID' to fuzzy match on – Nirav Mar 17 '17 at 01:35
  • 1
    i've edited your question so that it just includes the relevant code for your question, it makes it easier on potential answerers to mimimize the amount of unneeded code in your questions – maxymoo Mar 17 '17 at 04:17

1 Answers1

8

You could adapt your fuzzy_match function to take the id as a variable and use this to subset your choices before doing the fuzzy search (note that this requires applying the function over the whole dataframe rather than just the address column)

def fuzzy_match(x, choices, scorer, cutoff):
    match = process.extractOne(x['Address1'], 
                               choices=choices.loc[choices['ID'] == x['ID'], 
                                                   'Address1'], 
                               scorer=scorer, 
                               score_cutoff=cutoff)
    if match:
        return match[0]

test['FuzzyAddress1'] = test.apply(fuzzy_match, 
                                   args=(test2, fuzz.ratio, 80), 
                                   axis=1)
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Thank you for your suggestion @maxymoo! Out of curiosity, would you happen to have any ideas how to approach the problem this way: "To reduce the number of comparisons, you can first group records that have some features in common, like the first five characters of an address field, or a common token. Then, only compare records that share a feature. This idea is called "blocking" and will usually reduce the number of total comparisons you have to make to something manageable." Taken from the above link I posted. I've been trying groupby but it hasn't been giving me the results I want – Nirav Mar 17 '17 at 20:39
  • Although after trying your suggestion, I am getting this error: KeyError: ('Address1', 'occurred at index Address1') – Nirav Mar 17 '17 at 20:48
  • 1
    @Nirav did you include `axis=1`? – maxymoo Mar 19 '17 at 04:13
  • 1
    yup I did. After making the comment regarding the error I tried to find a solution on google and managed to figure out to turn axis = 1. Thank you all the same for your help! – Nirav Mar 20 '17 at 17:35
  • @maxymoo Thanks for the answer. Any idea why I seem am getting "TypeError: ("'NoneType' object is not subscriptable", 'occurred at index 20')" . If I remove the error index, I get the same error at a different index. I don't understand why these certain rows are causing a promlem – Mitchell May 22 '18 at 15:50
  • 1
    @Mitchell it's because there was no match for those rows (based on the cutoff), i've edited my answer top put a check in for this, but you could also adjust your cutoff – maxymoo May 23 '18 at 03:31