1

First, thanks for reading and taking the time to respond.

Second, the question:

I have a PxN matrix X where P is in the order of 10^6 and N is in the order of 10^3. So, X is relatively large and is not sparse. Let's say each row of X is an N-dimensional sample. I want to construct a PxP matrix of pairwise distances between these P samples. Let's also say I am interested in Hellinger distances.

So far I am relying on sparse dok matrices:

def hellinger_distance(X):
    P = X.shape[0]
    H1 = sp.sparse.dok_matrix((P, P))
    for i in xrange(P):
        if i%100 == 0:
            print i
        x1 = X[i]
        X2 = X[i:P]
        h = np.sqrt(((np.sqrt(x1) - np.sqrt(X2))**2).sum(1)) / math.sqrt(2)       
        H1[i, i:P] = h
    H = H1 + H1.T
    return H

This is super slow. Is there a more efficient way of doing this? Any help is much appreciated.

JRun
  • 669
  • 1
  • 10
  • 17
  • 2
    Do you *really* need to compute every pairwise distance? Your *PxP* matrix would contain 1E12 elements, so assuming you are dealing with double floats, you're looking at about 8000 GB just to store the output. If you just computed the upper triangle, that would be *(P^2 - P) / 2 ~= 5E11* elements, which would be about 4000 GB assuming double floats. Even a binary matrix of that size would be about 500 GB in size. – ali_m Oct 07 '15 at 18:23
  • you are correct. I learnt that with dense matrices in scipy you can calculate the condensed form (upper triangular only) directly, but not with sparse matrices. – JRun Aug 16 '16 at 22:09

2 Answers2

2

You can use pdist and squareform from scipy.spatial.distance -

from scipy.spatial.distance import pdist, squareform

out = squareform(pdist(np.sqrt(X)))/np.sqrt(2)

Or use cdist from the same -

from scipy.spatial.distance import cdist

sX = np.sqrt(X)
out = cdist(sX,sX)/np.sqrt(2)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks for the quick and clean response. I am still running this and the computation has not finished yet. I will update as I get the results. – JRun Oct 07 '15 at 18:19
  • @JRun Awesome! Let me know how that goes! And the performance boost (hoping there is some!). – Divakar Oct 07 '15 at 18:24
1

In addition to Divakar's response, I realized that there is an implementation of this in sklearn which allows parallel processing:

from sklearn.metrics.pairwise import pairwise_distances
njobs = 3
H = pairwise_distances(np.sqrt(X), n_jobs=njobs, metric='euclidean') / math.sqrt(2)

I will do some benchmarking and post the results later.

JRun
  • 669
  • 1
  • 10
  • 17