1

I have two lists/flat arrays a and b of 2D coordinates (x,y). (a contains up to ~1000 coordinates, b up to ~200,000.) I need to filter b to obtain a list of only those coordinates in b which are within a certain Euclidean distance, say 10, of any coordinate in a. For example:

import numpy as np

np.random.seed(123)

a = np.random.randint(50, size=(5, 2))
b = np.random.randint(50, size=(1000, 2))

def dist(coord_a, coord_b):
    squ_dist_x = (coord_a[0] - coord_b[0])**2
    squ_dist_y = (coord_a[1] - coord_b[1])**2
    return np.sqrt(squ_dist_x + squ_dist_y)

b_filter = []
for coord_b in b:
    for coord_a in a:
        distance = dist(coord_a, coord_b)
        if distance <= 5:
            b_filter.append(coord_b)

Output for b_filter:

[array([21, 30]),
 array([ 3, 30]),
 array([ 4, 30]),
 array([48, 32]),
 array([ 8, 32]),
 array([21, 32]),
 array([16, 30]),
 array([16, 30]),
 array([ 0, 32]),
 array([45, 32]),
 array([32, 30]),
 array([ 3, 31]),
 array([21, 30]),
 array([20, 31]),
 array([ 6, 32]),
 array([24, 31]),
 array([31, 31]),
 array([ 2, 31]),
 array([32, 32]),
 array([22, 32]),
 array([46, 30]),
 array([40, 32]),
 array([20, 31]),
 array([ 4, 32]),
 array([31, 31]),
 array([35, 30]),
 array([37, 32]),
 array([ 1, 32]),
 array([31, 31]),
 array([44, 30]),
 array([24, 32]),
 array([40, 32]),
 array([31, 30]),
 array([ 2, 31]),
 array([47, 32]),
 array([42, 32]),
 array([36, 30]),
 array([ 8, 31]),
 array([19, 30]),
 array([42, 31]),
 array([48, 31]),
 array([ 4, 30]),
 array([47, 32]),
 array([30, 32]),
 array([47, 31]),
 array([29, 31]),
 array([29, 31]),
 array([17, 32]),
 array([24, 32]),
 array([46, 30]),
 array([ 2, 30]),
 array([ 4, 30]),
 array([27, 30]),
 array([ 1, 31]),
 array([30, 32]),
 array([ 5, 32]),
 array([25, 32]),
 array([35, 30]),
 array([23, 31]),
 array([36, 31]),
 array([41, 30]),
 array([17, 30]),
 array([26, 30]),
 array([19, 30]),
 array([22, 31]),
 array([ 2, 31]),
 array([23, 32]),
 array([30, 30]),
 array([ 3, 31]),
 array([40, 30])]

except the arrays I need to do this with are way larger. (But only around 5% will be under the distance limit, as in the example) How can I do this most efficiently?

  • You may be interested in a *quadtree*. – user253751 Apr 15 '21 at 11:54
  • 2
    You can avoid lots of calculations by only calculating / filtering on the distance squared (drop the root calculation, and square the distance input of the filter) You can also try to keep the dsitance precalculated int the array, that would slightly slowdown insertion,s but save lots of time on any subsequent filtering. – Pac0 Apr 15 '21 at 11:54
  • also, the data generation is something I would do separately from the testing / benchmarking of the distance filter. It's probably significant here. – Pac0 Apr 15 '21 at 11:56
  • 1
    A little gain: once you have found a near `coord_a` points, you don't need to consider the other `coord_a` points. – Damien Apr 15 '21 at 11:56
  • Another simple way to speed up: sort both arrays according to `x` values. Then, for a given `coord_b.x`value, you only need to consider the `coord_a.x` in the range `(coord_b.x - distance, coord_b.x + distance)` – Damien Apr 15 '21 at 12:03
  • @Damien You'd probably only need to sort the `a` array. – Ian Abbott Apr 15 '21 at 12:10
  • @IanAbbott Sorting `a` is most important effectively. What I had in mind is that sorting `b` array may accelerate finding the `a` point at distance `coord_b.x - distance`, as it would be very near the point found for the previous `b` point. – Damien Apr 15 '21 at 12:14
  • 1
    @Damien Perhaps. There may also be some mileage in using `y` as a secondary sort key. – Ian Abbott Apr 15 '21 at 12:18
  • You can try cpu to memory trade off: create x-y grid with acceptable resolution, with zeros, mark cells around points in `a` (in 5 distance) as `1` and simply check value in appripriate cell for any point in `b`. – Stanislav Ivanov Apr 15 '21 at 12:24
  • I think there is also an error in your code: in your double loop, you possibly add `b_filter.append(coord_b)` repeatedly (once for each `a` point that matches by distance). – Pierre D Apr 15 '21 at 12:38
  • I have created a [new post](https://stackoverflow.com/questions/67109987/list-all-points-within-distance-to-line) which is more focused on my exact problem. Decided not to delete this one since it's accrued 4 answers, which may be useful to someone – Vainmonde De Courtenay Apr 15 '21 at 14:10

4 Answers4

3

One of the best ways to do that is by using the excellent scipy.spatial.KDTree. It does the whole thing for a: 1000 points and b: 200K points in 114 ms. That includes: building the kd-tree, querying it, and filtering b for distances within the radius.

(Important: use scipy >= 1.6.0, where the Python implementation of KDTree has been replaced by cKDTree. See the release notes).

# please use scipy >= 1.6.0
from scipy.spatial import KDTree

tree = KDTree(a)
radius = 10
dist, idx = tree.query(b, k=1, distance_upper_bound=radius + 1)

b_filtered = b[dist <= radius]

Explanation

  • We use tree.query() with distance_upper_bound=radius + 1 so as to include points that are exactly at the given radius (the behavior of .query() is to return points that are strictly less than the upper bound).
  • We then filter for distances less or equal to radius.
  • Of course, if you are interested about points where dist < radius, then no need to add 1, but it doesn't hurt since you have to filter anyway in the next step (points for which there is no close neighbor get +inf as distance).

Timing

a = np.random.randint(50, size=(1000, 2))
b = np.random.randint(50, size=(200_000, 2))

radius = 10
%timeit b_filtered = b[KDTree(a).query(b, k=1, distance_upper_bound=radius + 1)[0] <= radius]
# 114 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pierre D
  • 24,012
  • 7
  • 60
  • 96
0

A simple way to improve your code is to avoid the sqrt here:

return np.sqrt(squ_dist_x + squ_dist_y)

Just return the sum and then compare against 25.

BTW:

This seems wrong

    squ_dist_x = (coord_a[0] - coord_b[1])**2
                                       ^
                                      [0] ??
Support Ukraine
  • 42,271
  • 4
  • 38
  • 63
0

Using broadcasting you could create an index that indicate which coordinate are closer than n unit:

# Minimal distance:
n = 5
# Creation of the index:
ind = (((b[None,:]-a[:,None])**2).sum(2)**0.5<=n).sum(0) > 0
# Get the result:
b_filter = b[ind]

Where b[None,:]-a[:,None] return a (5,1000,2) array containing the coordinate difference. Then we simply apply the euclidian distance formula and we check if at least one value of a is closer from b than n.

obchardon
  • 10,614
  • 1
  • 17
  • 33
0

If, for some reason, you want to do it in pure numpy, here is a vectorized (but naive) expression:

b_filtered = b[np.any(np.square(b[:, None] - a).sum(axis=2) <= radius**2, axis=1)]

But this is O(m * n) for m, n points in a, b, respectively. For m, n = 1000, 200_000, it takes over 6 seconds.

The other answer I provided that uses scipy.spatial.KDTree is much faster (it cuts down a lot of the calculation) and does it in 114ms.

Pierre D
  • 24,012
  • 7
  • 60
  • 96