1

My data is as follows. As you can see, the first entry is 'tim' which matches with tim.rand and timrook. Similarly, pankit090 matches with pankit001, pankit002, pankit003, pankit004, pankit005

My data

I want the result to be like below

Result after grouping of names

What I was able to achieve is

emailsdb = database['Names'].values.tolist()
list = []
for email in emailsdb :
    newlookup = emailsdb.copy()
    newlookup.remove(email)
    result = process.extractBests(email, newlookup, score_cutoff=85, limit=50)
    if len(result) > 0: 
        list.append(email)
        list.append(result)

What I get is

['tim',
 [('tim.rand', 90), ('timrook', 90)],
 'tim.rand',
 [('tim', 90)],
 'pankit090',
 [('pankit001', 89),
  ('pankit002', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
 'timrook',
 [('tim', 90)],
 'pankit001',
 [('pankit090', 89),
  ('pankit002', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
 'pankit002',
 [('pankit090', 89),
  ('pankit001', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
...........
...........

The suggestion required is to reach the final result to be like above picture with 2 line items. The ones where fuzzywuzzy was able to find matching user names.

Also required is the count of distinct TID and distinct PID in the group.

Gupta
  • 314
  • 4
  • 17

0 Answers0