2

I am trying to use the spectral clustering method provided by scikit-learn to aggregate the rows of my dataset (which are only 16000). My issue arises after I precompute the affinity matrix (a 16000x16000 float matrix) which allocates 3 gigabyte more or less (I can reach up to 8 GB at most), the method called on that matrix, with the argpack solver, requires much more memory and the cpu loses so much time trying to swap things in and out of memory that the computation is slowed down to death. I've also tried to call the garbage collector before the method but without success:

import gc
gc.collect()

How can I get the precise scheme of what is going on? Is this a known issue? Are there any alternatives in python to perform spectral clustering out of the box?

I can post a minimal working example if needed.

UPDATE I was wrong concerning the lobpcg solver, at first it seems to use all my memory but then it stabilizes around 5Gb and the process continues, but another issue rises. It seems like the computation leads to some numerical inaccuracies that in the end generate Nans or an error like this one (the error that comes up changes while the affinity matrix changes a little, see below):

File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/spectral.py", line 255, in spectral_clustering
eigen_tol=eigen_tol, drop_first=False)
File "/usr/local/lib/python3.4/dist-packages/sklearn/manifold/spectral_embedding_.py", line 303, in spectral_embedding
largest=False, maxiter=2000)
File "/usr/local/lib/python3.4/dist-packages/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py", line 449, in lobpcg
assert np.allclose(gramA.T, gramA)
AssertionError

My affinity metric is a gaussian kernel exp(-((X_1 - x_2)^2/2*sigma^2)) and the results vary a lot when trying different values for sigma. I can understand that some inaccuracies can arise when some entries are close to zero, but that is not the case when I use large values for sigma (= 5.0), having them closer to 1.0 instead.

Now I am supposing that I am missing some point or I am doing something wrong in the process, so I'm updating this question with a new aim.

For reference, this is how I compute the affinity matrix:

 pairwise_dists = \
      scipy.spatial.distance.squareform(
           scipy.spatial.distance.pdist(data_slice,'sqeuclidean'))
 similarity_matrix = scipy.exp(-pairwise_dists /(2 * self._sigma ** 2))

where data_slice is a numpy array with its rows being my instances and self._sigma stores the gaussian sigma parameter.

rano
  • 5,616
  • 4
  • 40
  • 66
  • 1
    have you tried with a reduced subset, let's say 10x10, 100x100, 1Kx1K ? Spectral is based on calculation of eigenvectors what is a heavy operation. Going from a small dataset to a large one can help you figure out the expected time for the original dataset. – Picarus Sep 19 '14 at 08:37
  • I've tried with synthetic small datasets (200-400 rows) but not in a progressive way. It is true that I could be able to determine a viable dimension, but I would not find a solution for using the spectral clustering on the original dataset (nor what precisely is going wrong) – rano Sep 19 '14 at 09:08
  • Can you say which operation is swapping? It might be possible to use a different eigensolver or use float32. Why do you say "too much memory" in the title? you have less than three times the memory of storing the matrix for doing an eigendecomposition. That doesn't seem horribly wrong. How many clusters are you trying to get? If there are few clusters, only few eigenvectors should be needed. – Andreas Mueller Sep 19 '14 at 16:50
  • Btw, have you tried other clustering or dimensionality reduction methods? – Andreas Mueller Sep 19 '14 at 17:00
  • @AndreasMueller how is dimensionality reduction to help here? – Has QUIT--Anony-Mousse Sep 20 '14 at 18:42
  • Well not for applying spectral clustering but for getting insights into the dataset or applying other clustering methods in the primal. – Andreas Mueller Sep 21 '14 at 12:45

1 Answers1

2

Spectral clustering computes Eigenvectors of the dissimilarity matrix.

This matrix has size O(n^2), and thus pretty much any implementation will need O(n^2) memory.

16000x16000x4 (assuming float storage, and no overhead) is about 1 GB. It probably needs a working copy (methods such as scipy.exp will likely produce a copy of your matrix; and maybe with double precision), and some overhead, which is why you end up with using 3 GB...

This algorithm just is not applicable for large data, just like any other algorithm requiring O(n^2) memory. Choose a different algorithm; maybe one that can be accelerated using index structures. Or reduce your data set size, for example by sampling.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    As I understood the question, it was "why doesn't it work with 8gb ram if the dissimilarity matrix is 3gb" but maybe I misunderstood. – Andreas Mueller Sep 21 '14 at 12:46
  • I updated the question. Changing the eigensolver mitigates the memory issue but introduces numerical errors, I am wondering whether I am doing something wrong. – rano Sep 22 '14 at 07:26
  • To save memory, one could (in principle) compute the action of the matrix against vectors on-the-fly, and use that within iterative methods for calculating eigenvectors. Then one would only need O(n) memory, but take much greater time. – Nick Alger Nov 22 '15 at 00:03