I am trying to use the spectral clustering method provided by scikit-learn to aggregate the rows of my dataset (which are only 16000). My issue arises after I precompute the affinity matrix (a 16000x16000 float matrix) which allocates 3 gigabyte more or less (I can reach up to 8 GB at most), the method called on that matrix, with the argpack solver, requires much more memory and the cpu loses so much time trying to swap things in and out of memory that the computation is slowed down to death. I've also tried to call the garbage collector before the method but without success:
import gc
gc.collect()
How can I get the precise scheme of what is going on? Is this a known issue? Are there any alternatives in python to perform spectral clustering out of the box?
I can post a minimal working example if needed.
UPDATE I was wrong concerning the lobpcg solver, at first it seems to use all my memory but then it stabilizes around 5Gb and the process continues, but another issue rises. It seems like the computation leads to some numerical inaccuracies that in the end generate Nans or an error like this one (the error that comes up changes while the affinity matrix changes a little, see below):
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/spectral.py", line 255, in spectral_clustering
eigen_tol=eigen_tol, drop_first=False)
File "/usr/local/lib/python3.4/dist-packages/sklearn/manifold/spectral_embedding_.py", line 303, in spectral_embedding
largest=False, maxiter=2000)
File "/usr/local/lib/python3.4/dist-packages/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py", line 449, in lobpcg
assert np.allclose(gramA.T, gramA)
AssertionError
My affinity metric is a gaussian kernel exp(-((X_1 - x_2)^2/2*sigma^2))
and the results vary a lot when trying different values for sigma. I can understand that some inaccuracies can arise when some entries are close to zero, but that is not the case when I use large values for sigma (= 5.0), having them closer to 1.0 instead.
Now I am supposing that I am missing some point or I am doing something wrong in the process, so I'm updating this question with a new aim.
For reference, this is how I compute the affinity matrix:
pairwise_dists = \
scipy.spatial.distance.squareform(
scipy.spatial.distance.pdist(data_slice,'sqeuclidean'))
similarity_matrix = scipy.exp(-pairwise_dists /(2 * self._sigma ** 2))
where data_slice
is a numpy array with its rows being my instances and self._sigma
stores the gaussian sigma parameter.