2

i have numpy array in python which contains lots (10k+) of 3D vertex points (vectors with coordinates [x,y,z]). I need to calculate distance between all possible pairs of these points.

it's easy to do using scipy:

import scipy
D = spdist.cdist(verts, verts)

but i can't use this because of project policy on introducing new dependencies.

So i came up with this naive code:

def vert_dist(self, A, B):
    return ((B[0]-A[0])**2+(B[1]-A[1])**2+(B[2]-A[2])**2)**(1.0/2)

# Pairwise distance between verts
#Use SciPy, otherwise use fallback
try:
    import scipy.spatial.distance as spdist
    D = spdist.cdist(verts, verts)
except ImportError:
    #FIXME: This is VERY SLOW:
    D = np.empty((len(verts), len(verts)), dtype=np.float64)
    for i,v in enumerate(verts):
        #self.app.setStatus(_("Calculating distance %d of %d (SciPy not installed => using SLOW AF fallback method)"%(i,len(verts))), True)
        for j in range(i,len(verts)):
            D[j][i] = D[i][j] = self.vert_dist(v,verts[j])

vert_dist() calculates 3D distance between two vertices and rest of code just iterates over vertices in 1D array and for each one it calculates distance to every other in the same array and produces 2D array of distances.

But that's extremely slow (1000 times) when compared to scipy's native C code. I wonder if i can speed it up using pure numpy. at least to some extent.

Some more info: https://github.com/scipy/scipy/issues/9172

BTW i've tried PyPy JIT compiler and it was even slower (10 times) than pure python.

UPDATE: I was able to speed things up a little like this:

    def vert_dist_matrix(self, verts):
            #FIXME: This is VERY SLOW:
            D = np.empty((len(verts), len(verts)), dtype=np.float64)
            for i,v in enumerate(verts):
                    D[i] = D[:,i] = np.sqrt(np.sum(np.square(verts-verts[i]), axis=1))
            return D

This eliminates the inner loop by computing whole row at once, which makes stuff quite faster, but still noticeably slower than scipy. So i still look at @Divakar's solution

Harvie.CZ
  • 73
  • 1
  • 9
  • Are you looking for something like this? https://stackoverflow.com/a/51352819/4909087 – cs95 Aug 26 '18 at 21:51
  • this answer is also quite interesting as the performance is discussed: https://stackoverflow.com/a/37903795/8069403 – xdze2 Aug 26 '18 at 21:57
  • @coldspeed i already saw this post, but it was very unclear to me how can i modify it to do the `((B[0]-A[0])**2+(B[1]-A[1])**2+(B[2]-A[2])**2)**(1.0/2)` calculation rather than just subtraction. – Harvie.CZ Aug 26 '18 at 21:58
  • @xdze2 i've tried this, but it only replaces the `vert_dist()` function, which is fast enough. When i replace `((B[0]-A[0])**2+(B[1]-A[1])**2+(B[2]-A[2])**2)**(1.0/2)` with `np.sqrt(np.sum(np.square(B-A)))` it's bit more readable, but it still does not eliminate the 2D for loops which are the actual slow part of this code. – Harvie.CZ Aug 26 '18 at 22:06

1 Answers1

8

There's eucl_dist package (disclaimer: I am its author) that basically contains two methods to solve the problem of computing squared euclidean distances that are more efficient than SciPy's cdist, especially for large arrays ( with decent to large number of columns).

We will use some of the codes from its source code to adapt to our problem here to give us two approaches.

Approach #1

Following the wiki contents, we could leverage matrix-multiplication and some NumPy specific implementations for our first approach, like so -

def pdist_squareformed_numpy(a):
    a_sumrows = np.einsum('ij,ij->i',a,a)
    dist = a_sumrows[:,None] + a_sumrows -2*np.dot(a,a.T)
    np.fill_diagonal(dist,0)
    return dist

Approach #2

One more method would be to create "extended" versions of input arrays, again discussed in detail in that github source code link to have our second approach, which is better for lesser columns, as is the case here, like so -

def ext_arrs(A,B, precision="float64"):
    nA,dim = A.shape
    A_ext = np.ones((nA,dim*3),dtype=precision)
    A_ext[:,dim:2*dim] = A
    A_ext[:,2*dim:] = A**2

    nB = B.shape[0]
    B_ext = np.ones((dim*3,nB),dtype=precision)
    B_ext[:dim] = (B**2).T
    B_ext[dim:2*dim] = -2.0*B.T
    return A_ext, B_ext

def pdist_squareformed_numpy_v2(a):
    A_ext, B_ext = ext_arrs(a,a)
    dist = A_ext.dot(B_ext)
    np.fill_diagonal(dist,0)
    return dist

Note that these give us squared eucludean distances. So, for the actual distances, we want to use np.sqrt() if that's the final output needed.

Sample runs -

In [380]: np.random.seed(0)
     ...: a = np.random.rand(5,3)

In [381]: from scipy.spatial.distance import cdist

In [382]: cdist(a,a)
Out[382]: 
array([[0.  , 0.29, 0.42, 0.2 , 0.57],
       [0.29, 0.  , 0.58, 0.42, 0.76],
       [0.42, 0.58, 0.  , 0.45, 0.9 ],
       [0.2 , 0.42, 0.45, 0.  , 0.51],
       [0.57, 0.76, 0.9 , 0.51, 0.  ]])

In [383]: np.sqrt(pdist_squareformed_numpy(a))
Out[383]: 
array([[0.  , 0.29, 0.42, 0.2 , 0.57],
       [0.29, 0.  , 0.58, 0.42, 0.76],
       [0.42, 0.58, 0.  , 0.45, 0.9 ],
       [0.2 , 0.42, 0.45, 0.  , 0.51],
       [0.57, 0.76, 0.9 , 0.51, 0.  ]])

In [384]: np.sqrt(pdist_squareformed_numpy_v2(a))
Out[384]: 
array([[0.  , 0.29, 0.42, 0.2 , 0.57],
       [0.29, 0.  , 0.58, 0.42, 0.76],
       [0.42, 0.58, 0.  , 0.45, 0.9 ],
       [0.2 , 0.42, 0.45, 0.  , 0.51],
       [0.57, 0.76, 0.9 , 0.51, 0.  ]])

Timings on 10k points -

In [385]: a = np.random.rand(10000,3)

In [386]: %timeit cdist(a,a)
1 loop, best of 3: 309 ms per loop

# Approach #1
In [388]: %timeit pdist_squareformed_numpy(a) # squared eucl distances
1 loop, best of 3: 668 ms per loop

In [389]: %timeit np.sqrt(pdist_squareformed_numpy(a)) # actual eucl distances
1 loop, best of 3: 812 ms per loop

# Approach #2
In [390]: %timeit pdist_squareformed_numpy_v2(a) # squared eucl distances
1 loop, best of 3: 237 ms per loop

In [391]: %timeit np.sqrt(pdist_squareformed_numpy_v2(a)) # actual eucl distances
1 loop, best of 3: 395 ms per loop

The second approach seems close to cdist one on performance!

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks! Good edit. I deleted my previous comments and will return to delete this one later. – David Z Aug 26 '18 at 22:54
  • @Divakar This is really cool and blazing fast. Thanks! but in `pdist_squareformed_numpy_v2()` i had to replace `return dist` with `return np.abs(np.nan_to_num(dist))`. It works better now, but it gives inaccurate results in some cases. This code is made so that vertices closer than 1e-5 can be told as identical. But i have to lower the precision to 1e-1 to get same results as with scipy. Probably there's some loss of acuracy. – Harvie.CZ Aug 26 '18 at 22:57
  • @Harvie.CZ Why are you using `np.nan_to_num` ? Would you have `NaNs` in your input array? – Divakar Aug 26 '18 at 22:58
  • @Divakar i've added it before `np.abs()`, just found that with `np.abs()` it's not needed anymore. Probably `np.sqrt()` was returning complex numbers for negative values, which the rest of code was not able to interpret (eg. when comparing using < ). – Harvie.CZ Aug 26 '18 at 23:01
  • @Harvie.CZ Yeah simply `np.abs(pdist_squareformed_numpy_v2(a))` should suffice I would think. There could be those precision issues with these matrix-mult based methods. – Divakar Aug 26 '18 at 23:05
  • @Divakar I've filtered for values, that differ more than 1e-2 from scipy output. There are hundreds of cases, where `pdist_squareformed_numpy_v2()` returns values around 0.01 (eg. 0.01512599308381024 or 0.010880743067281888, etc...) but scipy returns 0.00. If i lower the treshold to 1e-5 it gets even more error cases. Is there way to increase matrix-mult precision? – Harvie.CZ Aug 26 '18 at 23:18
  • @Harvie.CZ Nah, its a matrix-mult thing and maybe exaggerated by the extended versions of approach#2. How about first approach : `pdist_squareformed_numpy(a)`? – Divakar Aug 26 '18 at 23:21
  • @Divakar I am testing on small dataset of 3072 vertices. v2 has 3130 errors, while v1 only has 748 errors at 1e-2 level. – Harvie.CZ Aug 26 '18 at 23:27
  • @Harvie.CZ Check out the `For high-precision results` added at the end of the post. – Divakar Aug 26 '18 at 23:46
  • @Divakar sorry, it seems that i was using input data in float32 instead of float64 format. Thanks for your help. This is so cool and useful!!! I was geting the data like that from another library. Now i convert them like this before processing: `verts = np.array(verts, dtype=np.float64)` – Harvie.CZ Aug 26 '18 at 23:58
  • @Harvie.CZ Hmm, so what works then? Approach #1 and #2? – Divakar Aug 26 '18 at 23:59
  • 1
    @Divakar both are working! I'll do some more benchmarking and probably go with first one, since it's shorter and easier to understand. – Harvie.CZ Aug 27 '18 at 00:01
  • i'll use the `pdist_squareformed_numpy()`. it's only 2 times slower than scipy. `pdist_squareformed_numpy_v2()` is relatively fast, but slower than my `vert_dist_matrix()` with vectorized inner loop. see the update in original post. – Harvie.CZ Aug 27 '18 at 00:21