Sparse implementations of distance computations in python / scikit-learn

Question

I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("somefile_svm.txt")

which returns a sparse scipy array X

I simply need to compute the pairwise distances of all training points as

D = pdist(X)

Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as

D = pdist(X.todense())

Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.

Many thanks

score 5 · Answer 1 · edited Jan 22 '19 at 11:17

5

In scikit-learn there is a sklearn.metrics.euclidean_distances function that works both for sparse matrices and dense numpy arrays. See the reference documentation.

However non-euclidean distances are not yet implemented for sparse matrices.

edited Jan 22 '19 at 11:17

Sociopath

13,068
19
47
75

answered Jan 22 '12 at 09:59

ogrisel

39,309
12
116
125

Thank you for your answer. At first it seemed as a solution to my problem since "euclidean_distances" works with sparse data, however even with `D=euclidean_distances(X, X)` I get an out of memory error. – Nicholas Jan 22 '12 at 16:10
@Nicholas: `euclidean_distances` necessarily returns an `X.shape[0]` × `X.shape[0]` dense array, which is 1e10 in your case. – Fred Foo Jan 22 '12 at 17:21
1

@Nicholas if you want to implement k-means on a large dataset (in the direction `X.shape[0]`), you should try the `sklearn.cluster.MiniBatchKMeans` class). It processes the input set incrementally by small chunks hence the memory usage is controlled. – ogrisel Jan 22 '12 at 18:31
Actually it's not k-means that I want to implement in python (there exist many efficient sparse implementations in C), but rather to implement measures that evaluate the quality of a clustering result. To this end, the simplicity of creating a python script would come handy but is seems that it cannot handle the memory requirements for my problem. Many thanks for all the answers! – Nicholas Jan 23 '12 at 08:50
1

If you data is very sparse and your rows are positive and normalized you can compute the sparse matrix of the dot product `A * A`. If the data is sparse enough that will fit in memory. The euclidean distances can then be implicitly defined element wise by `2 - 2 * A * A`. – ogrisel Jan 25 '12 at 22:30

Sparse implementations of distance computations in python / scikit-learn

1 Answers1

Linked