I am attempting to perform constrained clustering using sklearn's agglomerative clustering command. To make the algorithm constrained, it requests a "connectivity matrix". This is described as:
The connectivity constraints are imposed via an connectivity matrix: a scipy sparse matrix that has elements only at the intersection of a row and a column with indices of the dataset that should be connected. This matrix can be constructed from a-priori information: for instance, you may wish to cluster web pages by only merging pages with a link pointing from one to another.
I have a list of observation pairs that I wish the algorithm will force to remain in the same cluster. I can convert this into a sparse scipy
matrix (either coo
or csr
), but the resulting clusters fail to force the constraints.
Some Data:
import numpy as np
import scipy as sp
import pandas as pd
import scipy.sparse as ss
from sklearn.cluster import AgglomerativeClustering
# unique ids
ids = np.arange(10)
# Pairs that should belong to the same cluster
mustLink = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
# Features for training the model
data = pd.DataFrame([
[.0873,-1.619,-1.343],
[0.697456, 0.410943, 0.804333],
[-1.295829, -0.709441, -0.376771],
[-0.404985, -0.107366, 0.875791],
[-0.404985, -0.107366, 0.875791],
[-0.515996, 0.731980, -1.569586],
[1.024580, 0.409148, 0.149408],
[-0.074604, 1.269414, 0.115744],
[-0.006706, 2.097276, 0.681819],
[-0.432196, 1.249149,-1.159271]])
Convert the pairs to a "connectivity matrix":
# Blank coo matrix to csr
sm = ss.coo_matrix((len(ids), len(ids)), np.int32).tocsr()
# Insert 1 for connected pairs and diagonals
for i in np.arange(len(mustLink)): # add links to both sides of the matrix
sm[mustLink.loc[i, 'A'], mustLink.loc[i, 'B']] = 1
sm[mustLink.loc[i, 'B'], mustLink.loc[i, 'A']] = 1
for i in np.arange(sm.tocsr()[1].shape[1]): # add diagonals
sm[i,i] = 1
sm = sm.tocoo() # convert back to coo format
Train and fit the agglomerative clustering model:
m = AgglomerativeClustering(n_clusters=6, connectivity=sm)
out = m.fit_predict(X=data)
Warning I receive:
UserWarning: the number of connected components of the connectivity matrix is 7 > 1. Completing it to avoid stopping the tree early. connectivity, n_components = _fix_connectivity(X, connectivity)
In addition to the ominous warning, the pairs that I hoped would belong to the same cluster, do not.
Is this because the sklearn algorithms are not designed to handle mustlink constraints and instead can only use a distance
matrix (distinction drawn here)?