6

I'm trying to use scikit-learn's DBSCAN implementation to clusterize a bunch of documents. First I create the TF-IDF matrix using scikit-learn's TfidfVectorizer (it's a 163405x13029 sparse matrix of type numpy.float64). Then I try to clusterize specific subsets of this matrix. Things work fine when the subset is small (say, up to a few thousand rows). But with large subsets (with tens of thousands of rows) I get ValueError: could not convert integer scalar.

Here's the full traceback (idxs is a list of indices):


ValueError                        Traceback (most recent call last)
<ipython-input-1-73ee366d8de5> in <module>()
    193     # use descriptions to clusterize items
    194     ncm_clusterizer = DBSCAN()
--> 195     ncm_clusterizer.fit_predict(tfidf[idxs])
    196     idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_))
    197     for e in idxs_clusters:

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight)
    294             cluster labels
    295         """
--> 296         self.fit(X, sample_weight=sample_weight)
    297         return self.labels_

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight)
    264         X = check_array(X, accept_sparse='csr')
    265         clust = dbscan(X, sample_weight=sample_weight,
--> 266                        **self.get_params())
    267         self.core_sample_indices_, self.labels_ = clust
    268         if len(self.core_sample_indices_):

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, sample_weight, n_jobs)
    136         # This has worst case O(n^2) memory complexity
    137         neighborhoods = neighbors_model.radius_neighbors(X, eps,
--> 138                                                          return_distance=False)
    139 
    140     if sample_weight is None:

/usr/local/lib/python3.4/site-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance)
    584             if self.effective_metric_ == 'euclidean':
    585                 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 586                                           n_jobs=self.n_jobs, squared=True)
    587                 radius *= radius
    588             else:

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1238         func = partial(distance.cdist, metric=metric, **kwds)
   1239 
-> 1240     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1241 
   1242 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1081     if n_jobs == 1:
   1082         # Special case to avoid picklability checks in delayed
-> 1083         return func(X, Y, **kwds)
   1084 
   1085     # TODO: in some cases, backend='threading' may be appropriate

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    243         YY = row_norms(Y, squared=True)[np.newaxis, :]
    244 
--> 245     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    246     distances *= -2
    247     distances += XX

/usr/local/lib/python3.4/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    184         ret = a * b
    185         if dense_output and hasattr(ret, "toarray"):
--> 186             ret = ret.toarray()
    187         return ret
    188     else:

/usr/local/lib/python3.4/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
    918     def toarray(self, order=None, out=None):
    919         """See the docstring for `spmatrix.toarray`."""
--> 920         return self.tocoo(copy=False).toarray(order=order, out=out)
    921 
    922     ##############################################################

/usr/local/lib/python3.4/site-packages/scipy/sparse/coo.py in toarray(self, order, out)
    256         M,N = self.shape
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
    260 

ValueError: could not convert integer scalar

I'm using Python 3.4.3 (on Red Hat), scipy 0.18.1, and scikit-learn 0.18.1.

I tried the monkey patch suggested here but that didn't work.

Googling around I found a bugfix that apparently solved the same problem for other types of sparse matrices (like csr), but not for coo.

I've tried feeding DBSCAN a sparse radius neighborhood graph (instead of a feature matrix), as suggested here, but the same error happens.

I've tried HDBSCAN, but the same error happens.

How can I fix this or bypass this?

Community
  • 1
  • 1
Parzival
  • 2,004
  • 4
  • 33
  • 47
  • what is `idxs` in `fit_predict(tfidf[idxs])`. Are you using only some values from tfidf? – Vivek Kumar Mar 02 '17 at 01:43
  • `idxs` a list of indices (yes, I'm using only some values from tfidf - it has a total of ~163k documents, but I'm using only ~107k of them) – Parzival Mar 02 '17 at 13:11
  • Have you tried updating scipy and scikit version? – Vivek Kumar Mar 02 '17 at 13:24
  • They are both up-to-date (v0.18.1). – Parzival Mar 02 '17 at 13:41
  • What is `tfidf`? Print `type(tfidf)` and `tfidf.shape`. – sergzach Mar 03 '17 at 11:42
  • `type(tfidf)`: `scipy.sparse.csr.csr_matrix` `tfidf.shape`: `(163405, 13029)` – Parzival Mar 03 '17 at 13:19
  • @Parzival did you try to place a breakpoint at the position where the error occurs? Then have a look at the parameters that are passed to the function where the error occurs. They should be strange. I they do not appear strange to you, please post them here. – yar Mar 06 '17 at 19:13
  • Ok, took me forever to zero in on it but the cutoff is 65805 - that's the maximum number of documents I can clusterize without getting that error message. The parameter passed to the `toarray()` method is a `scipy.sparse.coo.coo_matrix` with as many rows as documents; it doesn't look weird at all. – Parzival Mar 07 '17 at 17:43
  • Wait, no, the problematic method is actually a bit further downstream - `coo_todense`. It receives as parameters `M` and `N`, where `M` is the number of documents and `N` is `M` divided by some number that keeps changing (1, 10, or 100); `self.row`, `self.col`, and `self.data`, which together form a distance matrix; `B.ravel('A')`, which I have no idea what it is (it's a vector of sorts, but I don't know what it represents); and `fortran`, which is an `int` that's always 0 in my case. None of these arguments look weird, whether I have 65805 documents or more. – Parzival Mar 07 '17 at 18:02
  • Could you print the output of: import numpy as np / print(np.intp) – Robin Mar 13 '17 at 11:09
  • For efficiency reasons, you don't want it to have to convert to a dense matrix, nor to compute a pairwise distance matrix (which would be 163405 x 153405) – Erich Schubert Mar 17 '17 at 13:54

1 Answers1

3

Even if the implementation would allow it, DBSCAN would probably yield bad results on such very high dimensional data (from a statistical point of view, because of the curse of dimensionality).

Instead I would advise you to use the TruncatedSVD class to reduce the dimensionality of your TF-IDF feature vectors down to 50 or 100 components and then to apply DBSCAN on the results.

ogrisel
  • 39,309
  • 12
  • 116
  • 125