2

I would like to ask on how to remove duplicate approximate word matching using fuzzy in python or ANY METHOD that is feasible. I have an excel that contains approximate similar name, at this point, I would like to remove the name that contains high similarity and remain only one name.

For instance, here is the input (excel file), there is 6 rows and 5 columns in total:

|-------------------|-----|-----|-----|-----|-----|  
| abby_john         | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|  
| abby_johnny       | def | def | def | def | def |  
|-------------------|-----|-----|-----|-----|-----|  
| a_j               | ghi | ghi | ghi | ghi | ghi |  
|-------------------|-----|-----|-----|-----|-----|  
| abby_(john)       | abc | abc | abc | abc | abc |  
|-------------------|-----|-----|-----|-----|-----|  
| john_abby_doe     | def | def | def | def | def | 
|-------------------|-----|-----|-----|-----|-----|  
| aby_/_John_Doedy  | ghi | ghi | ghi | ghi | ghi |  
|-------------------|-----|-----|-----|-----|-----|  

Although all the above of name looks different, they actually is the same, how should python know they all are the same and remove duplicated name and remains ANY ONE of name and remains it's entire row? By the way, the input file is in Excel file format (.xlsx).

Desired output:

|-------------------|-----|-----|-----|-----|-----|  
| abby_john         | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|  

Since the underscore is not very important, it can be replaced with 'spacing', thus another output as following is acceptable: Another desired output:

|-------------------|-----|-----|-----|-----|-----|  
| abby_john         | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|  

Appreciate a lot if anyone can help me out, thanks!

Edison Toh
  • 87
  • 1
  • 11

2 Answers2

2

This is a class of problem called semantic similarity.

Get the data:

from io import StringIO
s = StringIO("""abby_john         abc   abc   abc   abc 
abby_johnny       def   def   def   def 
a_j               ghi   ghi   ghi   ghi 
abby_(john)       abc   abc   abc   abc 
abby_john_doe     def   def   def   def 
aby_John_Doedy    ghi   ghi   ghi   ghi
abby john         ghi   ghi   ghi   ghi
john_abby_doe     def   def   def   def
aby_/_John_Doedy  ghi   ghi   ghi   ghi
doe jane          abc   abc   abc   abc
doe_jane          def   def   def   def""")

import pandas as pd
df = pd.read_fwf(s,header=None,sep='\s+')
lst_original = df[0].tolist() # the first column 

Vectorize (turn into numerical representation):

import numpy as np 
from gensim.models import Word2Vec

m = Word2Vec(lst_original,size=50,min_count=1,cbow_mean=1)  
def vectorizer(sent,m): 
    vec = [] 
    numw = 0 
    for w in sent: 
        try: 
            if numw == 0: 
                vec = m[w] 
            else: 
                vec = np.add(vec, m[w]) 
            numw += 1 
        except Exception as e: 
            print(e) 
    return np.asarray(vec) / numw 

l = []
for i in lst_original:
    l.append(vectorizer(i,m))

X = np.array(l)

KMeans clustering:

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=2,init='k-means++',n_init=100,random_state=0)
labels = clf.fit_predict(X)

Then we get just the values where the cluster alternates:

previous_cluster = 0
for index, sentence in enumerate(lst_original):
    if index > 0:
        previous_cluster = labels[index - 1]
    cluster = labels[index]
    if previous_cluster != cluster:
        print(str(labels[index]) + ":" + str(sentence))

Result, and as you can see a_j is treated differently to the rest of the abby_john group:

1:a_j
0:abby_(john)
1:doe jane
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
1

I use this function to correct and replace names and then I remove the duplicate matches, maintaining only the first match

import difflib
import re

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():
        diz_map[s] = re.sub(r'[^a-z]', '', s.lower())
        reverse_map[re.sub(r'[^a-z]', '', s.lower())] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.5), key=len)[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

df = pd.DataFrame({'name':['abby_john','abby_johnny','a_j','abby_(john)','john_abby_doe','aby_/_John_Doedy'],
                       'col1':['abc','add','sda','sas','sad','ass'],
                       'col2':['abc','add','sda','sas','sad','ass'],
                       'col3':['abc','add','sda','sas','sad','ass']})

df['name'] = similarity_replace(df.name)
df

enter image description here

df.drop_duplicates(['name'])

enter image description here

a_j seems not possible to remove

Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54