Remove duplicate approximate word matching using fuzzy python

Question

I would like to ask on how to remove duplicate approximate word matching using fuzzy in python or ANY METHOD that is feasible. I have an excel that contains approximate similar name, at this point, I would like to remove the name that contains high similarity and remain only one name.

For instance, here is the input (excel file), there is 6 rows and 5 columns in total:

|-------------------|-----|-----|-----|-----|-----|  
| abby_john         | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|  
| abby_johnny       | def | def | def | def | def |  
|-------------------|-----|-----|-----|-----|-----|  
| a_j               | ghi | ghi | ghi | ghi | ghi |  
|-------------------|-----|-----|-----|-----|-----|  
| abby_(john)       | abc | abc | abc | abc | abc |  
|-------------------|-----|-----|-----|-----|-----|  
| john_abby_doe     | def | def | def | def | def | 
|-------------------|-----|-----|-----|-----|-----|  
| aby_/_John_Doedy  | ghi | ghi | ghi | ghi | ghi |  
|-------------------|-----|-----|-----|-----|-----|

Although all the above of name looks different, they actually is the same, how should python know they all are the same and remove duplicated name and remains ANY ONE of name and remains it's entire row? By the way, the input file is in Excel file format (.xlsx).

Desired output:

|-------------------|-----|-----|-----|-----|-----|  
| abby_john         | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|

Since the underscore is not very important, it can be replaced with 'spacing', thus another output as following is acceptable: Another desired output:

|-------------------|-----|-----|-----|-----|-----|  
| abby_john         | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|

Appreciate a lot if anyone can help me out, thanks!

what if a_j is forced to be similar to the rest, does it still possible to be solved? — Edison Toh, May 17 '20 at 21:35
I'll post an answer. But it doesn't handle the `a_j` as similar. — mechanical_meat, May 17 '20 at 21:41
Great!! thanks a lot, bro! by the way, I've edited the input data, could you mind to use the revised input data? thanks in advance — Edison Toh, May 17 '20 at 21:47
I used the revised input data with a couple of added entries, too. The other answer is good, and less code! — mechanical_meat, May 17 '20 at 22:56
Thanks a lot, bro! I will try your code later, really appreciate it! — Edison Toh, May 18 '20 at 07:46

mechanical_meat · Answer 1 · 2020-05-18T10:41:04.163

2

This is a class of problem called semantic similarity.

Get the data:

from io import StringIO
s = StringIO("""abby_john         abc   abc   abc   abc 
abby_johnny       def   def   def   def 
a_j               ghi   ghi   ghi   ghi 
abby_(john)       abc   abc   abc   abc 
abby_john_doe     def   def   def   def 
aby_John_Doedy    ghi   ghi   ghi   ghi
abby john         ghi   ghi   ghi   ghi
john_abby_doe     def   def   def   def
aby_/_John_Doedy  ghi   ghi   ghi   ghi
doe jane          abc   abc   abc   abc
doe_jane          def   def   def   def""")

import pandas as pd
df = pd.read_fwf(s,header=None,sep='\s+')
lst_original = df[0].tolist() # the first column

Vectorize (turn into numerical representation):

import numpy as np 
from gensim.models import Word2Vec

m = Word2Vec(lst_original,size=50,min_count=1,cbow_mean=1)  
def vectorizer(sent,m): 
    vec = [] 
    numw = 0 
    for w in sent: 
        try: 
            if numw == 0: 
                vec = m[w] 
            else: 
                vec = np.add(vec, m[w]) 
            numw += 1 
        except Exception as e: 
            print(e) 
    return np.asarray(vec) / numw 

l = []
for i in lst_original:
    l.append(vectorizer(i,m))

X = np.array(l)

KMeans clustering:

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=2,init='k-means++',n_init=100,random_state=0)
labels = clf.fit_predict(X)

Then we get just the values where the cluster alternates:

previous_cluster = 0
for index, sentence in enumerate(lst_original):
    if index > 0:
        previous_cluster = labels[index - 1]
    cluster = labels[index]
    if previous_cluster != cluster:
        print(str(labels[index]) + ":" + str(sentence))

Result, and as you can see a_j is treated differently to the rest of the abby_john group:

1:a_j
0:abby_(john)
1:doe jane

edited May 18 '20 at 10:41

answered May 17 '20 at 22:49

mechanical_meat

163,903
24
228
223

ok, thanks bro! I will try your code later, really appreciate it! – Edison Toh May 18 '20 at 07:55
Hi, mechanical_meat, I had tried on your code, it returned error ```NameError: name 'lst_original' is not defined```, may i know why is this error occurrence? – Edison Toh May 18 '20 at 10:38
Ok, tell me how it works out when you get a chance! :) – mechanical_meat May 18 '20 at 10:48
Ya, it worked perfectly! May I know what is the meaning of ```1:a_j 0:abby_(john) 1:doe jane```? what is '1:' and '0:' represents? – Edison Toh May 18 '20 at 10:57
Great to hear! The 0 and 1 are the clusters of similarity. Then we take only the first of each alternating cluster for the result. I hope that makes sense. – mechanical_meat May 18 '20 at 11:00
Is that means 1 represents dissimilar while 0 represents similar? – Edison Toh May 18 '20 at 11:04
Not exactly. It means that the similar ones are grouped into either 0 or 1. So in the data when there is a dissimilar element the number changes from 0 to 1 or 1 to 0. – mechanical_meat May 18 '20 at 11:07
is that means 'a_j' and 'doe jane' is in one group since they both show '1' while 'abby_(john)' is in another group since it shows '0'? – Edison Toh May 18 '20 at 11:27
No, the groups alternate. So each of those are in separate groups. – mechanical_meat May 18 '20 at 11:30
Hi, @mechanical_meat, can I ask a favor from you? – Edison Toh May 19 '20 at 09:15
@EdisonToh: what's the favor? – mechanical_meat May 19 '20 at 16:52
I am actually needs your help on my another post regarding fuzzy matching problem https://stackoverflow.com/questions/61874002/copy-approximate-string-matching-from-excel-to-another-excel-file-using-python, however, the problem has been solved, thanks anyway!! – Edison Toh May 19 '20 at 20:00
Oh, by the same answer too! Nice. – mechanical_meat May 19 '20 at 20:09
Ya, thanks for your response for preparing to help out! – Edison Toh May 19 '20 at 20:44
m[w] dont work in gensim 4 now use m.wv[w] – Shahid Chaudhary Feb 18 '22 at 06:20

Marco Cerliani · Accepted Answer · 2020-05-19T14:38:00.297

I use this function to correct and replace names and then I remove the duplicate matches, maintaining only the first match

import difflib
import re

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():
        diz_map[s] = re.sub(r'[^a-z]', '', s.lower())
        reverse_map[re.sub(r'[^a-z]', '', s.lower())] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.5), key=len)[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

df = pd.DataFrame({'name':['abby_john','abby_johnny','a_j','abby_(john)','john_abby_doe','aby_/_John_Doedy'],
                       'col1':['abc','add','sda','sas','sad','ass'],
                       'col2':['abc','add','sda','sas','sad','ass'],
                       'col3':['abc','add','sda','sas','sad','ass']})

df['name'] = similarity_replace(df.name)
df

df.drop_duplicates(['name'])

a_j seems not possible to remove

Thanks a lot, bro! I will definitely try it and would let you know if it works, really appreciate it! — Edison Toh, May 18 '20 at 07:46

Remove duplicate approximate word matching using fuzzy python

2 Answers2