3

So I have a table where I have identified fuzzy matches and an amount. I want to be able to summarize the amount by this common key.

My Data looks like this:

Name Match1 Match2 Amount
Jame James Jim 1
Jame James Jim 2
Mike Mikes Miike 3
James Jame Jim 4
Jim Jame James 5
Mikes Mike Miike 6
Miike Mike Mikes 7
data = {'Name': {0: 'Jame',
  1: 'Jame',
  2: 'Mike',
  3: 'James',
  4: 'Jim',
  5: 'Mikes',
  6: 'Miike'},
 'Match1': {0: 'James',
  1: 'James',
  2: 'Mikes',
  3: 'Jame',
  4: 'Jame',
  5: 'Mike',
  6: 'Mike'},
 'Match2': {0: 'Jim',
  1: 'Jim',
  2: 'Miike',
  3: 'Jim',
  4: 'James',
  5: 'Miike',
  6: 'Mikes'},
 'Amount': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}

df = pd.DataFrame.from_dict(data)

What I'd like my resulting table to look like:

Key Name Match1 Match2 Amount
Jame Jame James Jim 1
Jame Jame James Jim 2
Mike Mike Mikes Miike 3
Jame James Jame Jim 4
Jame Jim Jame James 5
Mike Mikes Mike Miike 6
Mike Miike Mike Mikes 7

So I can summarize my data like this:

Name Amount
Jame 12
Mike 16
Galaxy
  • 172
  • 9
David 54321
  • 568
  • 1
  • 9
  • 23

1 Answers1

1

If you initially have the list of keys for your dataframe, you can do this.

mylist = ['Jame','Mike']

df['Key'] = [j for i in df.to_numpy() for j in mylist if j in i]

df.set_index(df.columns[-1]).reset_index(inplace=True)

    Key   Name Match1 Match2  Amount
0  Jame   Jame  James    Jim       1
1  Jame   Jame  James    Jim       2
2  Mike   Mike  Mikes  Miike       3
3  Jame  James   Jame    Jim       4
4  Jame    Jim   Jame  James       5
5  Mike  Mikes   Mike  Miike       6
6  Mike  Miike   Mike  Mikes       7

And then, you can do groupby.sum to get your desired output.

df.groupby('Key')['Amount'].sum().reset_index()
    Key  Amount
0  Jame      12
1  Mike      16

Update

To obtain the key through the dataframe, you can use the process.extractOne in fuzzywuzzy to compare the matching percentage of every element in a row. If it's less than the threshold, we will update the key.

from fuzzywuzzy import process

key = [df.Name[0]]

for row in df.to_numpy():
    for i in row:
        if 50 < process.extractOne(i, key)[1]:
            break
        else:
            key.append(row[0])
            break
            

key
Out[65]: ['Jame', 'Mike']
  • I don't have a list of keys. I think the logic I would like is that it would take the first name where the matches take place and all subsequent items were all the items match it would take the first 'Key'. – David 54321 Aug 22 '22 at 18:06
  • IIUC, for Jame, the first time it appears and has the matches is on index 0, whereby Mike is on index 2. So we use these two names as the keys –  Aug 22 '22 at 19:29
  • Yes, but I want the code to define the keys not myself – David 54321 Aug 22 '22 at 20:01
  • 1
    hey @David54321 You can find the key using this code ```key = set() for row in df.to_numpy(): if not key: key.add(row[0]) found = False for k in key: if k in row: found = True break if not found : key.add(row[0])``` – Noman Aug 22 '22 at 20:52
  • See if it works for you now? –  Aug 23 '22 at 00:29