5

I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("somefile_svm.txt")

which returns a sparse scipy array X

I simply need to compute the pairwise distances of all training points as

D = pdist(X)

Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as

D = pdist(X.todense())

Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.

Many thanks

ogrisel
  • 39,309
  • 12
  • 116
  • 125
Nicholas
  • 313
  • 2
  • 9

1 Answers1

5

In scikit-learn there is a sklearn.metrics.euclidean_distances function that works both for sparse matrices and dense numpy arrays. See the reference documentation.

However non-euclidean distances are not yet implemented for sparse matrices.

Sociopath
  • 13,068
  • 19
  • 47
  • 75
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thank you for your answer. At first it seemed as a solution to my problem since "euclidean_distances" works with sparse data, however even with `D=euclidean_distances(X, X)` I get an out of memory error. – Nicholas Jan 22 '12 at 16:10
  • @Nicholas: `euclidean_distances` necessarily returns an `X.shape[0]` × `X.shape[0]` dense array, which is 1e10 in your case. – Fred Foo Jan 22 '12 at 17:21
  • 1
    @Nicholas if you want to implement k-means on a large dataset (in the direction `X.shape[0]`), you should try the `sklearn.cluster.MiniBatchKMeans` class). It processes the input set incrementally by small chunks hence the memory usage is controlled. – ogrisel Jan 22 '12 at 18:31
  • Actually it's not k-means that I want to implement in python (there exist many efficient sparse implementations in C), but rather to implement measures that evaluate the quality of a clustering result. To this end, the simplicity of creating a python script would come handy but is seems that it cannot handle the memory requirements for my problem. Many thanks for all the answers! – Nicholas Jan 23 '12 at 08:50
  • 1
    If you data is very sparse and your rows are positive and normalized you can compute the sparse matrix of the dot product `A * A`. If the data is sparse enough that will fit in memory. The euclidean distances can then be implicitly defined element wise by `2 - 2 * A * A`. – ogrisel Jan 25 '12 at 22:30