I have a DataFrame with IDs, names, and addresses. I would like to cluster the addresses via affinity propagation or another algorithm in order to fuzzy match/group on the address strings. This part I have:
import pandas as pd
import pyodbc
import numpy as np
from sklearn.cluster import AffinityPropagation
from pyjarowinkler import distance
from sklearn import metrics
conn = pyodbc.connect(r'DSN=<UserDSN>;')
df = pd.read_sql('select * from <InputTable>', conn)
addr = df['Addresses']
addr = np.asarray(addr)
jw = np.array([[distance.get_jaro_distance(w1,w2) for w1 in addr] for w2 in addr])
affprop = AffinityPropagation(affinity="precomputed", damping=.5)
affprop.fit(jw)
for cluster_id in np.unique(affprop.labels_):
exemplar = addr[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(addr[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster
Now, how do I make this clustering useful by having a "Cluster" column in the DataFrame? Essentially, I want to add the exemplar
for each cluster back to the corresponding rows in the DataFrame. Do I need some kind of unique ID to be able to do that? The purpose of this is to identify duplicate rows in the data, so there is no Unique ID currently. However, perhaps I can add one to the original DataFrame somehow since each row as a whole will be unique?
Thank you all for any insight you may have!