1

Here is my dataframe:

df = pd.DataFrame(
    dict(Name=['Emma Howard', 'Emma Ward', 'Emma Warner', 'Emma Wayden'],
         Age=[33, 34, 43, 44], Score=[90, 95, 93, 92])
)

list2 = df['Name'].tolist()

I am applying fuzzywuzzy process:

process.extractBests(i, list2, score_cutoff=80, scorer=fuzz.ratio)

to extract the best matches on the column Name and it is giving the result as below: enter image description here

The output I'm expecting is: enter image description here

The logic is the "Emma Howard" and "Emma Ward" are already matched in the first row, hence I do not want to show "Emma Howard" in the second row matches and same for the 3rd and fourth rows.

Here is the complete pseudo code:

mat1 = []
list1 = df['Name'].tolist()
list2 = df['Name'].tolist()
list3 = df['Name'].tolist()

for i in list1:
    list2 = [x for x in list2 if x != i]
    mat1.append(process.extractBests(i, list2, score_cutoff=80, scorer=fuzz.ratio))
    list2 = list3
df['matches'] = mat1
df.to_csv("xyz.csv")
Kingston X
  • 65
  • 5
  • I don't know how many rows you have, but you _could_ consider matching across all rows simultaneously. In other words, Q Are you results stable if you change the order of your rows? Would you want them to be? – jtlz2 May 23 '23 at 08:10
  • Also, do you care about the order of the matches within a row? – jtlz2 May 23 '23 at 08:13
  • The number of rows is about 35k and the order doesn't matter. – Kingston X May 23 '23 at 08:21

1 Answers1

0

IIUC, once a name has been used, it is no longer available for subsequent lines, so you can use set operations to remove already assigned names:

uniques = set(df['Name'])
matches = {}
for idx, row in df.iterrows():
    uniques -= set([row.Name])  # remove current name
    res = process.extractBests(row.Name, uniques, score_cutoff=80)
    uniques -= set([name for name, score in res])  # remove best results
    matches[idx] = res
df['matches'] = pd.Series(matches)

Note: at each iteration, the comparison is faster because there are fewer rows in the set.

Output:

>>> df
          Name  Age  Score                                 matches
0  Emma Howard   33     90                       [(Emma Ward, 90)]
1    Emma Ward   34     95  [(Emma Wayden, 80), (Emma Warner, 80)]
2  Emma Warner   43     93                                      []
3  Emma Wayden   44     92                                      []
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I updated my answer, can you check it please? – Corralien May 23 '23 at 08:03
  • Thanks a lot for your answer, but I'm getting a different output than expected. This is what I'm getting Name Age Score matches Emma Howard 33 90 0 Emma Ward 34 95 1 Emma Warner 43 93 2 Emma Wayden 44 92 3 – Kingston X May 23 '23 at 08:10
  • What version of Pandas do you have? I updated my answer. It should be right now. – Corralien May 23 '23 at 09:14
  • Thanks a lot. I am using 3.7 and understood why it wasn't working earlier. In my case I do not want to remove the best matches as they could match with another name. I modified that part and got the desired result :) – Kingston X May 23 '23 at 12:04