3

In my current approach, I have

from scipy.sparse import csr_matrix
from sklearn.cluster import AgglomerativeClustering
import pandas as pd

s = pd.DataFrame([[0.8, 0. , 3. ],
       [1. , 1. , 2. ],
       [0.3, 3. , 4. ]], columns=['dist', 'v1', 'v2'])
sparseD = csr_matrix((1-s['dist'], (s['v1'].values, s['v2'].values)), shape=(N, N))
agg = AgglomerativeClustering(n_clusters=None, affinity='precomputed', linkage='complete', distance_threshold=.25)
agg.fit_predict(sparseD)

The last line raises

TypeError: cannot take a sparse matrix.

If I cast the data toarray, the code works and produces the expected output, but uses a lot of memory and is slow: on the real data size: 61K x 61K.

I am wondering if there is another library (or scikit API) that can do the same linkage clustering on a precomputed, sparse Distance matrix -- if there were no entry for a given (element1, element2) pair, the API would not link them and everything else would be the same.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Sam Shleifer
  • 1,716
  • 2
  • 18
  • 29
  • 1
    Who or what raises the `TypeError`? You won't get much help if you are stingy with the debugging information. – hpaulj Jun 27 '19 at 19:56
  • Sorry! The last line raises `TypeError` – Sam Shleifer Jun 27 '19 at 22:00
  • There are sklearn operations that do work with sparse matrices. More than most packages. Apparently this isn't one of those. Function docs should be clear on the matter. – hpaulj Jun 27 '19 at 22:15
  • 1
    Totally! My question is whether there is a similar function that works on sparse matrices. – Sam Shleifer Jun 28 '19 at 23:22
  • @Sam Sheleifer, did you ever find out the answer. I am after the same issue, but no solution so far... Thank you. – Memin Dec 12 '21 at 01:19

0 Answers0