In my current approach, I have
from scipy.sparse import csr_matrix
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
s = pd.DataFrame([[0.8, 0. , 3. ],
[1. , 1. , 2. ],
[0.3, 3. , 4. ]], columns=['dist', 'v1', 'v2'])
sparseD = csr_matrix((1-s['dist'], (s['v1'].values, s['v2'].values)), shape=(N, N))
agg = AgglomerativeClustering(n_clusters=None, affinity='precomputed', linkage='complete', distance_threshold=.25)
agg.fit_predict(sparseD)
The last line raises
TypeError: cannot take a sparse matrix.
If I cast the data toarray
, the code works and produces the expected output, but uses a lot of memory and is slow: on the real data size: 61K x 61K.
I am wondering if there is another library (or scikit API) that can do the same linkage clustering on a precomputed, sparse Distance matrix -- if there were no entry for a given (element1, element2) pair, the API would not link them and everything else would be the same.