4

I have a 2D array and I want to find for each (x, y) point the distance to its nearest neighbor as fast as possible.

I can do this using scipy.spatial.distance.cdist:

import numpy as np
from scipy.spatial.distance import cdist

# Random data
data = np.random.uniform(0., 1., (1000, 2))
# Distance between the array and itself
dists = cdist(data, data)
# Sort by distances
dists.sort()
# Select the 1st distance, since the zero distance is always 0.
# (distance of a point with itself)
nn_dist = dists[:, 1]

This works, but I feel like its too much work and KDTree should be able to handle this but I'm not sure how. I'm not interested in the coordinates of the nearest neighbor, I just want the distance (and to be as fast as possible).

Gabriel
  • 40,504
  • 73
  • 230
  • 404

1 Answers1

4

KDTree can do this. The process is almost the same as when using cdist. But cdist is much faster. And as pointed out in the comments, cKDTree is even faster:

import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import KDTree
from scipy.spatial import cKDTree
import timeit

# Random data
data = np.random.uniform(0., 1., (1000, 2))

def scipy_method():
    # Distance between the array and itself
    dists = cdist(data, data)
    # Sort by distances
    dists.sort()
    # Select the 1st distance, since the zero distance is always 0.
    # (distance of a point with itself)
    nn_dist = dists[:, 1]
    return nn_dist

def KDTree_method():
    # You have to create the tree to use this method.
    tree = KDTree(data)
    # Then you find the closest two as the first is the point itself
    dists = tree.query(data, 2)
    nn_dist = dists[0][:, 1]
    return nn_dist

def cKDTree_method():
    tree = cKDTree(data)
    dists = tree.query(data, 2)
    nn_dist = dists[0][:, 1]
    return nn_dist

print(timeit.timeit('cKDTree_method()', number=100, globals=globals()))
print(timeit.timeit('scipy_method()', number=100, globals=globals()))
print(timeit.timeit('KDTree_method()', number=100, globals=globals()))

Output:

0.34952507635557595
7.904083715193579
20.765962179145546

Once again, then very unneeded proof that C is awesome!

Akaisteph7
  • 5,034
  • 2
  • 20
  • 43
  • Wow, that's a lot of difference in runtime. I just assumed that `KDTree` would be considerably faster, not sure why. Thanks @Akaisteph7! – Gabriel Jul 21 '19 at 00:59
  • 1
    I followed Paul's advice and tried `cKDTree` instead of `KDTree` (just need to change two letters in the code above) and it is *orders* of magnitudes faster than `cdist`. – Gabriel Jul 21 '19 at 01:06
  • Can this be used to measure the (x,y) distance to the closest 0 value in a binary file? – Ricardo Guerreiro Jan 21 '22 at 13:34