0

I've written a function called muzz that leverages the fuzzywuzzy module to 'merge' two pandas dataframes. Works great, but the performance is pretty bad on larger frames. Please take a look at my apply() that does the extracting/scoring and let me know if you have any ideas that could speed it up.

import pandas as pd
import numpy as np
import fuzzywuzzy as fw

Create a frame of raw data

dfRaw = pd.DataFrame({'City': {0: u'St Louis',
                      1: 'Omaha',
                      2: 'Chicogo',
                      3: 'Kansas  city',
                      4: 'Des Moine'},
                      'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 3 : 'MO', 4 : 'IA'}})

Which yields

City    State
0   St Louis    MO
1   Omaha   NE
2   Chicogo IL
3   Kansas city MO
4   Des Moine   IA

Then a frame that represents the good data that we want to look up

dfLocations = pd.DataFrame({'City': {0: 'Saint Louis',
                          1: u'Omaha',
                          2: u'Chicago',
                          3: u'Kansas City',
                          4: u'Des Moines'},
                         'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 
                                   3 : 'KS', 4 : 'IA'},
                          u'Zip': {0: '63201', 1: '68104', 2: '60290', 
                                   3: '68101', 4: '50301'}})

Which yields

    City    State   Zip
0   Saint Louis MO  63201
1   Omaha   NE  68104
2   Chicago IL  60290
3   Kansas City KS  68101
4   Des Moines  IA  50301

and now the muzz function. EDIT: Added choices= right[match_col_name] line and used choices in the apply per Brenbarn suggestion. I also, per Brenbarn suggestion, ran some tests with the extractOne() without the apply and it it appears to be the bottleneck. Maybe there's a faster way to do the fuzzy matching?

 def muzz(left, right, on, match_col_name='match_on',score_col_name='score_match',
     right_suffix='_match', score_cutoff=80):  

     right[match_col_name] = np.sum(right[on],axis=1)
     choices= right[match_col_name] 

     ###The offending statement### 
     left[[match_col_name,score_col_name]] = 
         pd.Series(np.sum(left[on],axis=1)).apply(lambda x : pd.Series(
         fw.process.extractOne(x,choices,score_cutoff=score_cutoff))) 

     dfTemp = pd.merge(left,right,how='left',on=match_col_name,suffixes=('',right_suffix))         
     return dfTemp.drop(match_col_name, axis=1)

Calling muzz

muzz(dfRaw.copy(),dfLocations,on=['City','State'], score_cutoff=85)

Which yields

    City        State   score_match City_match  State_match Zip
0   St Louis    MO      87          Saint Louis MO          63201
1   Omaha       NE      100         Omaha       NE          68104
2   Chicogo     IL      89          Chicago     IL          60290
3   Kansas city MO      NaN         NaN         NaN         NaN
4   Des Moine   IA      96          Des Moines  IA          50301
Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
  • One possibility is to pull the `right[match_col_name]` outside of the `apply`, so that you don't recalculate it every time. You should try profiling your code and testing it with different inputs to see if the bottleneck is really in the apply or in the fuzzy matching itself. – BrenBarn Feb 09 '15 at 02:01
  • @BrenBarn I don't think I'm tracking you. I'm just passing a Series(right[match_col_name]) as the choices parameter for the extractOne() function – Bob Haffner Feb 09 '15 at 02:49
  • Yes, but that is inside the function you apply, which is called once for each element in the series you apply it on. Your code retrieves `right[match_col_name]` repeatedly, once for each element in the Series. If you extract this once to a variable and then use the variable in the call, you will avoid all those redundant lookups. – BrenBarn Feb 09 '15 at 02:55
  • Ok, gotcha. I'll give it a go – Bob Haffner Feb 09 '15 at 03:23

0 Answers0