Create a Fuzzy Duplicate Key to Sum Rows with Fuzzy Matches (Pandas)

Question

So I have a table where I have identified fuzzy matches and an amount. I want to be able to summarize the amount by this common key.

My Data looks like this:

Name	Match1	Match2	Amount
Jame	James	Jim	1
Jame	James	Jim	2
Mike	Mikes	Miike	3
James	Jame	Jim	4
Jim	Jame	James	5
Mikes	Mike	Miike	6
Miike	Mike	Mikes	7

data = {'Name': {0: 'Jame',
  1: 'Jame',
  2: 'Mike',
  3: 'James',
  4: 'Jim',
  5: 'Mikes',
  6: 'Miike'},
 'Match1': {0: 'James',
  1: 'James',
  2: 'Mikes',
  3: 'Jame',
  4: 'Jame',
  5: 'Mike',
  6: 'Mike'},
 'Match2': {0: 'Jim',
  1: 'Jim',
  2: 'Miike',
  3: 'Jim',
  4: 'James',
  5: 'Miike',
  6: 'Mikes'},
 'Amount': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}

df = pd.DataFrame.from_dict(data)

What I'd like my resulting table to look like:

Key	Name	Match1	Match2	Amount
Jame	Jame	James	Jim	1
Jame	Jame	James	Jim	2
Mike	Mike	Mikes	Miike	3
Jame	James	Jame	Jim	4
Jame	Jim	Jame	James	5
Mike	Mikes	Mike	Miike	6
Mike	Miike	Mike	Mikes	7

So I can summarize my data like this:

Name	Amount
Jame	12
Mike	16

So the `Key` will match either the `name`, the `match1` or the `match2`? — , Aug 22 '22 at 14:28

score 1 · Accepted Answer · 2022-08-23T00:29:31.923

1

If you initially have the list of keys for your dataframe, you can do this.

mylist = ['Jame','Mike']

df['Key'] = [j for i in df.to_numpy() for j in mylist if j in i]

df.set_index(df.columns[-1]).reset_index(inplace=True)

    Key   Name Match1 Match2  Amount
0  Jame   Jame  James    Jim       1
1  Jame   Jame  James    Jim       2
2  Mike   Mike  Mikes  Miike       3
3  Jame  James   Jame    Jim       4
4  Jame    Jim   Jame  James       5
5  Mike  Mikes   Mike  Miike       6
6  Mike  Miike   Mike  Mikes       7

And then, you can do groupby.sum to get your desired output.

df.groupby('Key')['Amount'].sum().reset_index()
    Key  Amount
0  Jame      12
1  Mike      16

Update

To obtain the key through the dataframe, you can use the process.extractOne in fuzzywuzzy to compare the matching percentage of every element in a row. If it's less than the threshold, we will update the key.

from fuzzywuzzy import process

key = [df.Name[0]]

for row in df.to_numpy():
    for i in row:
        if 50 < process.extractOne(i, key)[1]:
            break
        else:
            key.append(row[0])
            break
            

key
Out[65]: ['Jame', 'Mike']

edited Aug 23 '22 at 00:29

answered Aug 22 '22 at 14:47

I don't have a list of keys. I think the logic I would like is that it would take the first name where the matches take place and all subsequent items were all the items match it would take the first 'Key'. – David 54321 Aug 22 '22 at 18:06
IIUC, for Jame, the first time it appears and has the matches is on index 0, whereby Mike is on index 2. So we use these two names as the keys – Aug 22 '22 at 19:29
Yes, but I want the code to define the keys not myself – David 54321 Aug 22 '22 at 20:01
1

hey @David54321 You can find the key using this code ```key = set() for row in df.to_numpy(): if not key: key.add(row[0]) found = False for k in key: if k in row: found = True break if not found : key.add(row[0])``` – Noman Aug 22 '22 at 20:52
See if it works for you now? – Aug 23 '22 at 00:29

Create a Fuzzy Duplicate Key to Sum Rows with Fuzzy Matches (Pandas)

What I'd like my resulting table to look like:

1 Answers1

Update