List all points within distance to line

Question

I have a list of coordinates a. I want to specify a radius r and then list all points on a 2D grid within the specified radius of any point in a, together with the minimum distance of each of those grid-points to any point in a. Since a is substantial (~1000-2000 points) I want to be as efficient as possible.

My implementation so far uses this code to list all coordinates within the given radius of a point; then iterates that over all coordinates in a; then flattens the output, takes a set (there will be MANY duplicates) - this is the set of coordinates within the specified radius of any point on the line - then calculates the minimum distances of that set to a using scipy.spatial.distance.cdist:

import numpy as np
np.random.seed(123)
a = np.random.randint(50, size=(100, 2))

def collect(coord, sigma: float =3.0):
    """ create a small collection of points in a neighborhood of some point 
    """
    neighborhood = []
    
    x=coord[0]
    y=coord[1]

    X = int(sigma)
    for i in range(-X, X + 1):
        Y = int(pow(sigma * sigma - i * i, 1/2))
        for j in range(-Y, Y + 1):
            neighborhood.append((x + i, y + j))

    return neighborhood

rad = 10
coord_list = [collect(coord, rad) for coord in a]
coord_set = np.array(list(set([val for sublist in coord_list for val in sublist])))

from scipy.spatial import distance

dists = distance.cdist(coord_set, a).min(axis=1)

Result:

coord_set
> array([[26, 21],
       [18, 17],
       [50,  6],
       ...,
       [14, 47],
       [15, 12],
       [ 7,  8]])

dists
> array([2.23606798, 5.65685425, 1.41421356, ..., 3.16227766, 3.        ,
       1.41421356])

Does anyone have any ways I can improve this and do it more efficiently?

Out of curiosity: What describes this line? Is it a function you have? Or some sort of definition of the line in `a`? — LeoE, Apr 15 '21 at 14:09
@LeoE. Looks like a collection of random numbers from the code. Not really a line — Mad Physicist, Apr 15 '21 at 14:11
@MadPhysicist yes, but when OP talks about "I have a line (not a straight-line), coordinates contained in a." I assumed, that besides the random set of coordinates there also exists a line? Or did I misunderstand something? — LeoE, Apr 15 '21 at 14:13
Could you generate a plot of what you are looking for? `a` appears to be a collection of random points. Your prose is not very clear about what the grid is and what you are trying to do exactly. — Mad Physicist, Apr 15 '21 at 14:14
@LeoE. I am pretty sure that "line" should read "collection of points". I don't think there's a deeper meaning to it. — Mad Physicist, Apr 15 '21 at 14:14
Let's consider it a collection of points (coordinates in `a`) for now. If my code cannot be improved under such circumstances we can maybe consider the form of the line - it is the output of OpenCV `cv2.findContours` - but the more general problem seems worthwhile focusing on if possible — Vainmonde De Courtenay, Apr 15 '21 at 14:17
Yes, all of them. In other words, just assume that the function that produces the line generates a list of coordinates of points in the line but is otherwise a black-box function (we cannot guarantee that a given point is vertically, horizontally, or diagonally linked to the last). The line is certainly not some simple mathematical function. The coordinates in the line - and _only_ the coordinates in the line - are listed in `a`. — Vainmonde De Courtenay, Apr 15 '21 at 14:26
Do we need to 'draw' the theoretical line or are you essentially alright with finding points with minimum distance to the vertices of the line? Also could you include your full code including point generation so that its possible to time solutions. Although it looks like you are using a brute force method. — Jason Chia, Apr 15 '21 at 14:27
@JasonChia no need to plot anything. Assume the line is just a random collection of points (since I can't find any rule actually linking one coordinate to the next). The full code is included (I generate the "line" i.e. list of coordinates as `a`). — Vainmonde De Courtenay, Apr 15 '21 at 14:32
@JasonChia if you want a longer problem (like I'll actually have to deal with in practice) to time solutions, just alter `a = np.random.randint(50, size=(100, 2))` to say `a = np.random.randint(1000, size=(2000, 2))`. You should expect my code to take well over 30 seconds, hence the desire for optimization — Vainmonde De Courtenay, Apr 15 '21 at 14:37
Could you elaborate on the grid please? I figured a plot might be helpful to clarify that. — Mad Physicist, Apr 15 '21 at 15:30
@MadPhysicist check out the code I linked for my method of collection, the OP of that question has a grid. But it is simply a matter of e.g., for coordinate (6,2), the coordinates within 2 units of that should be listed (i.e., ([(4, 2), (5, 1), (5, 2), (5, 3), (6, 0), (6, 1), (6, 2), (6, 3), (6, 4), (7, 1), (7, 2), (7, 3), (8, 2)]). Then we want that for any point in `a` (or any point in the line - starting to wonder whether that's somehow a necessary simplification to get any improvement, but as I said I can find no mathematical link from one coordinate to the next). — Vainmonde De Courtenay, Apr 15 '21 at 15:33
OK. So not really a specific grid, just all integer coordinates. Gotcha — Mad Physicist, Apr 15 '21 at 15:42
Final question: `neighborhood.append((x + i, y + j))` is throwing me off here. This won't actually be integer grid points if your `x` and `y` aren't integers. Did you mean to run the loop over `range(int(x - sigma), int(x + sigma) + 1)`? — Mad Physicist, Apr 15 '21 at 16:07
@MadPhysicist `x` and `y` are guaranteed to be integers (they are the first and second elements of coordinates I generated from `np.random.randint` for my example). In general the entire code relating to the `collect()` function is from the post I linked (though I welcome any improvements to that part too!) — Vainmonde De Courtenay, Apr 15 '21 at 16:08

Pierre D · Accepted Answer · 2021-04-19T16:16:22.260

You can adapt the answer I gave to your other linked question in a very straightforward way. The outcome is also very fast (~425 ms for 10K points in a).

Edit: For sparse cases (where the number of points actually filtered is a small portion of the complete grid), I also add a sparse_grid_within_radius() version below.

(Important: use scipy >= 1.6.0, where the Python implementation of KDTree has been replaced by cKDTree. See the release notes).

# please use scipy >= 1.6.0
from scipy.spatial import KDTree


def grid_within_radius(a, radius=10, resolution=(1,1), origin=(0,0)):
    pmin = (a - origin).min(axis=0) - radius
    pmax = (a - origin).max(axis=0) + radius
    imin = np.floor(pmin / resolution)
    imax = np.ceil(pmax / resolution) + 1
    xv, yv = np.meshgrid(np.arange(imin[0], imax[0]), np.arange(imin[1], imax[1]))
    grid = np.stack((xv, yv), axis=-1) * resolution + origin
    dist, _ = KDTree(a).query(grid, k=1, distance_upper_bound=radius + 0.01)
    idx = dist <= radius
    return grid[idx], dist[idx]

Usage

First, the OP's example:

np.random.seed(123)
a = np.random.randint(10, size=(100, 2))
g, d = grid_within_radius(a)

To compare with the OP's result, we need to sort their solution (coord_set, dists):

def sort2d(a, other=None):
    other = a if other is None else other
    return other[np.lexsort((a[:, 0], a[:, 1]))]

With this, we can check that our solution is the same:

>>> np.allclose(g, sort2d(coord_set))
True

>>> np.allclose(d, sort2d(coord_set, dists))
True

And another example (with a different grid resolution and radius):

g, d = grid_within_radius(a, radius=0.6, resolution=(.11, .47))
plt.scatter(*a.T, s=10, c='r')
plt.scatter(*g.T, s=1)

Speed

a = np.random.randint(1000, size=(10_000, 2))
%timeit grid_within_radius(a)
# 425 ms ± 528 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sparse version

The above works well when the number of points returned is a significant portion of the total grid (e.g. 30% or more). But for very sparse cases (e.g. where a combination of the bounding box of a, the radius and the grid resolution result in generating a huge grid and then eliminate most of it), then it is slow. To illustrate, the following is a sparse case for a = np.random.randint(0, 200, (10, 2)):

The version below fixes that by generating "patches" of grid around quantized locations instead of the full grid.

import numpy as np
import pandas as pd
from scipy.spatial import KDTree


def unique_rows(a):
    # np.unique(a, axis=0) is slow, in part because it sorts;
    # using pandas.drop_duplicates() instead.
    # see https://github.com/numpy/numpy/issues/11136#issuecomment-822000680
    return pd.DataFrame(a).drop_duplicates().values

def sparse_grid_within_radius(a, radius=10, resolution=(1,1), origin=(0,0), rbox=1):
    resolution = np.array(resolution)
    box_size = radius * rbox
    nxy0 = np.floor(radius / resolution).astype(int)
    nxy1 = np.ceil((box_size + radius) / resolution).astype(int) + 1
    patch = np.stack(list(map(np.ravel, np.indices(nxy0 + nxy1))), axis=1) - nxy0
    
    ar = np.floor((a - origin) / box_size) * box_size
    ia = unique_rows(np.floor(ar / resolution).astype(int))
    grid = unique_rows((ia[:, None] + patch).reshape(-1, 2)) * resolution + origin

    dist, _ = KDTree(a).query(grid, k=1, distance_upper_bound=radius * 1.01)
    idx = dist <= radius
    return grid[idx], dist[idx]

With this, even very sparse results are fast.

Example

a = np.random.randint(0, 4000, (100, 2))

%timeit g, d = grid_within_radius(a)
# 3.88 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit gs, ds = sparse_grid_within_radius(a)
# 29.6 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Speed comparison

import perfplot

perfplot.show(
    setup=lambda n: np.random.randint(0, n, (100, 2)),
    kernels=[grid_within_radius, sparse_grid_within_radius],
    n_range=[2 ** k for k in range(14)],
    equality_check=lambda a, b: np.allclose(sort2d(a[0]), sort2d(b[0])),
)

interesting but for me extremely slow: tried on `b = np.random.randint(800, size=(1500, 2)); %timeit grid_within_radius(b)` output: `2min 53s ± 11.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)` Meanwhile, on the same kernel the method in OP is taking 5-6 seconds for that `b` — Vainmonde De Courtenay, Apr 19 '21 at 10:13
That said, it does seem like an algorithm developed for kNN would have a good chance to solve this problem — Vainmonde De Courtenay, Apr 19 '21 at 10:36
the case where I found it to be slow is if the potential grid is enormous and the result is sparse, e.g.: `a = np.random.randint(10000, size=(1000, 2))`. For this, I have a solution (using a first rough grid to find local areas of the space worth pursuing, then generate the fine mesh there). — Pierre D, Apr 19 '21 at 11:40
but first we need to find why it is slow for you in the example above (`randint(800, size=(1500, 2))`). I ran this in a brand new kernel, copying back from the code I posted, and I see 210ms. What versions of `numpy`, `scipy` and `python` are you running? — Pierre D, Apr 19 '21 at 11:49
you may want to update `scipy` (the latest is `1.6.2`, I use `1.6.1`). It seems that some important improvements [have been done on `scipy.spatial.KDTree` in `1.6.0`](https://docs.scipy.org/doc/scipy/reference/release.1.6.0.html#scipy-spatial-improvements). — Pierre D, Apr 19 '21 at 12:05
Thanks! Looks like it now works super-fast, and thanks for explaining why. — Vainmonde De Courtenay, Apr 19 '21 at 13:53
cool; there is still a speed issue for fine-grain grids and sparse space. I'll think more about it when I have time. I have one option that generates "patches" around quantized `a`, then use `np.unique` on those and a final `KDTree.query()`. But `np.unique()` is known to be slow(er than it should be). See [this Numpy issue](https://github.com/numpy/numpy/issues/11136#issuecomment-822000680). The other option is to use our own `grid_within_radius` a first time to establish a coarse grid of interesting locations, then generate patches at these locations and call again. — Pierre D, Apr 19 '21 at 14:14
what about sparsity for non-massive grids (e.g., `np.random.randint(1500, size=(300, 2))`), would you expect this to be problematic? For me this does take a few seconds, although I wouldn't expect it to really cause issues. Also, is it a function of the total grid coordinates or of the sparsity - e.g., if somehow my `a` were all localized within one 1000x1000 square of a 10,000x10,000 grid, would it run quickly? — Vainmonde De Courtenay, Apr 19 '21 at 14:29
yeah, on `np.random.randint(1500, size=(300, 2))`, the standard approach above takes 600ms on my machine; the `sparse_grid_within_radius` takes 89ms. I'll add it, but it uses a trick version of `np.unique()`, this is less focused on your question so I'm reluctant to add all that complexity. — Pierre D, Apr 19 '21 at 14:38
ok, I added a sparse version; check it out. Note: I'm sure there are many other ways to do this, but this is an okay compromise (a bit slower for density above 30%, but much faster for sparse cases). — Pierre D, Apr 19 '21 at 15:51
thanks. What's on the x-axis of your speed comparison? Is it simply `n`? and approximately where is the speed crossover (n=250? n=400?) — Vainmonde De Courtenay, Apr 19 '21 at 18:10
Yes it's `n`. The crossover in this case is around `n=250`, but it really depends on the density (ratio of the size of the filtered grid, aka the result, to the full grid). That ratio where the two methods cross over is around 30% in my tests. That said, I would not bother: just always use the `sparse` version, as the time difference when the non-sparse wins is usually small. — Pierre D, Apr 19 '21 at 18:33

Mad Physicist · Answer 2 · 2021-04-15T17:56:47.727

0

Let's inspect your implementation carefully. Notice that you are already poised and partially compute a distance metric in collect.

What if you made neighborhood a dict instead of a list, keyed by grid point with a value of minimum distance. You could entirely eliminate the call to set and cdist.
If a can contain float values, you should your range from int(coord[0] - rad) to int(coord[0] + rad) + 1. int(0.5 - 10) is -9, while int(0.5) - 10 is -10.
You can compare to the squared radius, so you don't need to take the square root except once for the final result.

Points 2 and 3 are relatively minor improvements.

Here is an example:

rad = 10
rad2 = rad**2

points = {}

for coord in a:
    for x in range(int(np.ceil(coord[0] - rad)), int(coord[0] + rad) + 1):
        dx2 = (x - coord[0])**2
        chord = np.sqrt(rad2 - dx2)
        for y in range(int(np.ceil(coord[1] - chord)), int(coord[1] + chord) + 1):
            d2 = dx2 + (y - coord[1])**2
            key = x, y
            points[key] = min(d2, points.get(key, rad2))

To convert this into numpy arrays:

grids = np.array(list(points.keys()))
nearest = np.sqrt(np.fromiter(points.values(), count=len(points), dtype=float))

edited Apr 15 '21 at 17:56

answered Apr 15 '21 at 16:39

Mad Physicist

107,652
25
181
264

certainly much faster. Since you went as far as to provide the `nearest` indices, it would be nice if you could finish off by computing `dists` (i.e. the minimum distance of each point in your `grids`, to `a`). What I'm a little hesitant about is, why is there a difference between your `grids` and my implementation's `coord_set`? (check for the example the lengths: the seemingly innocuous 4281 vs 4285 difference becomes huge when we use much larger arrays) – Vainmonde De Courtenay Apr 15 '21 at 17:03
@VainmondeDeCourtenay. I got carried away and added a couple of extra items. Fixing momentarily. BTW, your solution is about 2x faster than mine... – Mad Physicist Apr 15 '21 at 17:04
No, yours appears to scale much better with large arrays which is what I really need. As I mentioned above we are interested in `a = np.random.randint(1000, size=(2000, 2))` and the like. On arrays of this size your solution appears to be ~40x faster – Vainmonde De Courtenay Apr 15 '21 at 17:05
@VainmondeDeCourtenay. I didn't check that. Thanks for noticing. For 100 elements, it's 30ms vs 60ms on my machine. For 10k elements, its 5s vs 10s, so not any better. – Mad Physicist Apr 15 '21 at 17:15
Ah I see. Did you include computing `dists` in your benchmarking (I didn't)? Anyway perhaps not _unequivocally_ better then- still, please drop a comment if you find out how to eliminate the extra results in `grids` and/or if you compute `dists` at the end. – Vainmonde De Courtenay Apr 15 '21 at 17:27
@VainmondeDeCourtenay. I've updated the code. Got too smart with trimming the indices for floating point line and radius. I'm timing the distances for both, since it's integrated directly into my computation. For some reason, timeit is behaving strangely. Your code definitely takes longer to run on very large arrays, but timeit reports a shorter time. – Mad Physicist Apr 15 '21 at 18:02
@VainmondeDeCourtenay. I'm trying to see if there's something you can do with sparse matrices here instead of a dictionary. Will get back to you. – Mad Physicist Apr 15 '21 at 18:11
for kicks, you may want to check the solution I provide. On my machine, it is 16x faster than this on 10K points. It is simply using `meshgrid` and `scipy.spatial.KDTree`. – Pierre D Apr 17 '21 at 18:10