How do to fuzzy matching on excel file using Pandas?

Question

I have a table called account with two columns - ID & NAME. ID is a hash which is unique but NAME is a string which might have duplicates.

I'm trying to write a python script to read this excel file and match 0-3 similar NAME values, but I just cannot seem to get it to work. Could someone help out? Thanks

import pandas as pd
from fuzzywuzzy import fuzz
import difflib

def get_spr(row):
    d = name1.apply(lambda x: (fuzz.ratio(x['NAME'], row['NAME']) * 0 if row['ID'] == x['ID'] else 1), axis=1)
    d = d[d>= 60]
    if len(d) == 0:
        v = ['']*2
    else:
        v = name1.ix[d.idxmax(),['ID' , 'NAME']].values
    return pd.Series(v, index=['ID', 'NAME'])


def score(tablerow):
    d = name1.apply(lambda x: fuzz.ratio(x['NAME'],tablerow['NAME']) * (0 if x['ID']==tablerow['ID'] else 1), axis=1)
    d = d[d>90]
    if len(d) == 0:
        v = [''] * 2
    else:
        v = name1.ix[d.order(ascending=False).head(3).index, ['ID' , 'NAME']].values
    return pd.DataFrame(v, index=['ID', 'NAME'])

account = "account_test.xlsx"

xl_acc1 = pd.ExcelFile(account)
xl_acc2 = pd.ExcelFile(account)

acc1 = xl_acc1.parse(xl_acc1.sheet_names[0])
acc2 = xl_acc2.parse(xl_acc2.sheet_names[0])

name1 = acc1[pd.notnull(acc1['NAME'])]
name2 = acc2[pd.notnull(acc2['NAME'])]
print 'Doing Fuzzy Matching'


name2= pd.concat((name2,name2.apply(get_spr, axis=1)), axis=1)
name2.to_excel(pd.ExcelWriter('res.xlsx'),'acc')

Any help would be much appreciated!

The file has rows like this:-

ID                    NAME
0016F00001c7GDZQA2  Daniela Abriani
0016F00001c7GPnQAM  Daniel Abriani
0016F00001c7JRrQAM  Nisha Well
0016F00001c7Jv8QAE  Katherine
0016F00001c7cXiQAI  Katerine
0016F00001c7dA3QAI  Katherin
0016F00001c7kHyQAI  Nursing and Midwifery Council Research Office
0016F00001c8G8OQAU  Nisa Well

Expected (output dataframe) would be something like:

      ID               NAME          ID2          NAME2
    <hash1>          katherine      <hash2>       katerine
    <hash1>          katherine      <hash3>       katherin
    <hash4>          Nisa Well      <hash5>       Nisha Well

Issue: The above code just reproduces the input as the output saved file without actually concatenating any matches.

what do you try to achieve by 'match' ? what is the expected result according to your example? — Colonel Beauvel, Mar 14 '16 at 08:27
Sorry for not specifying earlier. I've added the expected result. Is this possible? I need to save it to an excel file. — g0d, Mar 14 '16 at 08:34
No problem but how does the output relates to the input you gave? the data seem different. (why there are two colums ID and ID2 ? could it be more columns in some cases?) — Colonel Beauvel, Mar 14 '16 at 08:42
The input has two columns (ID and NAME). The output has 4 columns because it needs to have the similar records (similarity indicated by NAME) side by side. The actual value of the NAME doesn't matter in my example. My output shows how matching is done. I need a way to produce such an output using fuzzywuzzy library. — g0d, Mar 14 '16 at 09:06
Pleading to anyone to help out with the pandas specific portion of the code! — g0d, Mar 14 '16 at 11:18
more seriously, nobody wants to read this code to find out what's happening there. It is more usefull to express what you get as input, what you want as output and how you proceeded so far. — Colonel Beauvel, Mar 14 '16 at 11:21
Then can they not ignore the code and give suggestions or their approaches to similar problems? I've given the code to show "how you proceeded so far". What more do you want? — g0d, Mar 14 '16 at 13:26
Oh and the code is barely 30 lines. I would've thought that those interested in data analysis and willing to help out wouldn't have minded reading those few lines. I've gone above and beyond what is typically needed as a starting point for information when asking for help. I question, in fact, whether you've had any experience with pandas before. — g0d, Mar 14 '16 at 13:28

score 1 · Accepted Answer · answered Mar 14 '16 at 13:27

I don't think you need to do this in pandas. Here is my sloppy solution but it gets your desired output using a dictionary.

from fuzzywuzzy import process
df = pd.DataFrame([
                ['0016F00001c7GDZQA2',  'Daniela Abriani'],
                ['0016F00001c7GPnQAM',  'Daniel Abriani'],
                ['0016F00001c7JRrQAM',  'Nisha Well'],
                ['0016F00001c7Jv8QAE', 'Katherine'],
                ['0016F00001c7cXiQAI', 'Katerine'],
                ['0016F00001c7dA3QAI',  'Katherin'],
                ['0016F00001c7kHyQAI',  'Nursing and Midwifery Council Research Office'],
                ['0016F00001c8G8OQAU',  'Nisa Well']], 
                columns=['ID', 'NAME'])

get unique hashes in to a dictionary.

hashdict = dict(zip(df['ID'], df['NAME']))

define a function checkpair. You'll need it to remove reciprocal hash pairs. This method will add (hash1, hash2) and (hash2, hash1), but I think you only want to keep one of those pairs:

def checkpair (a,b,l):
    for x in l:
        if (a,b) == (x[2],x[0]):
            l.remove(x)

Now iterate through hashdict.items() finding the top 3 matches along the way. The fuzzywuzzy docs detail the process method.

matches = []
for k,v in hashdict.items():

    #see docs for extract -- 4 because you are comparing a name to itself
    top3 = process.extract(v, hashdict, limit=4)

    #remove the hashID compared to itself
    for h in top3:
        if k == h[2]:
            top3.remove(h)

    #append tuples to the list "matches" if it meets a score criteria      
    [matches.append((k, v, x[2], x[0], x[1])) for x in top3 if x[1] > 60] #change score?

    #remove reciprocal pairs
    [checkpair(m[0], m[2], matches) for m in matches]

df = pd.DataFrame(matches, columns=['id1', 'name1', 'id2', 'name2', 'score'])
# write to file
writer = pd.ExcelWriter('/path/to/your/file.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()

Output:

    id1     name1   id2     name2   score
0   0016F00001c7JRrQAM  Nisha Well  0016F00001c8G8OQAU  Nisa Well   95
1   0016F00001c7GPnQAM  Daniel Abriani  0016F00001c7GDZQA2  Daniela Abriani     97
2   0016F00001c7Jv8QAE  Katherine   0016F00001c7dA3QAI  Katherin    94
3   0016F00001c7Jv8QAE  Katherine   0016F00001c7cXiQAI  Katerine    94
4   0016F00001c7dA3QAI  Katherin    0016F00001c7cXiQAI  Katerine    88

Thanks so much for your response! I will try this later but in the mean time I have one question - Why do I need to add to hashdict? The IDs are hash values unique for each record but they're not actually used for anything other than identifying unique records. I hope this question comes out okay - I'm just trying to get clarity with this without trying to be offensive in any way. — g0d, Mar 14 '16 at 13:34
I keyed on the id because I assumed that you could have multiple IDs with the same name. If that's not the case then you could, of course, modify the code to be much cleaner. — Kevin, Mar 14 '16 at 13:40
Glad it helped! Let me know if I misunderstood the hash id, I can modify the code, although, it works "as-is". — Kevin, Mar 14 '16 at 17:18

How do to fuzzy matching on excel file using Pandas?

1 Answers1