2

I have a DataFrame with IDs, names, and addresses. I would like to cluster the addresses via affinity propagation or another algorithm in order to fuzzy match/group on the address strings. This part I have:

import pandas as pd
import pyodbc
import numpy as np
from sklearn.cluster import AffinityPropagation
from pyjarowinkler import distance
from sklearn import metrics

conn = pyodbc.connect(r'DSN=<UserDSN>;')
df = pd.read_sql('select * from <InputTable>', conn)

addr = df['Addresses']
addr = np.asarray(addr)

jw = np.array([[distance.get_jaro_distance(w1,w2) for w1 in addr] for w2 in addr])

affprop = AffinityPropagation(affinity="precomputed", damping=.5)
affprop.fit(jw)

for cluster_id in np.unique(affprop.labels_):
    exemplar = addr[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(addr[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster

Now, how do I make this clustering useful by having a "Cluster" column in the DataFrame? Essentially, I want to add the exemplar for each cluster back to the corresponding rows in the DataFrame. Do I need some kind of unique ID to be able to do that? The purpose of this is to identify duplicate rows in the data, so there is no Unique ID currently. However, perhaps I can add one to the original DataFrame somehow since each row as a whole will be unique?

Thank you all for any insight you may have!

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75

1 Answers1

0
df['new_col'] = list(affprop.labels_)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
mluk
  • 11
  • 1
  • 2
  • Please remember to include an explanation when answering questions, instead of simply providing code. That will ensure this answer is useful for people not familiar with this syntax—which, we might assume, will be the same people with this type of question. Can you edit your answer to offer some detail on what this does? – Jeremy Caney Jun 19 '21 at 19:05