Use each neighbor once in sklearn NearestNeighbor

Question

I am comparing two point clouds of different sizes. I don't want to cutoff the last points in the larger pointcloud pc1. For points in pc1 I would like to find the nearest neighbor in pc2. After using this point in pc1 and pc2 it should not be used again for any other comparison. Calculate distances from pc1 to pc2: Start from lowest distance. Take both points in pc1 and pc2 and flag the points as used (or delete). Get next pair with next higher distance ... Until all points in pc2 are used. Return List of distances.

Here is what I tried so far:

from sklearn.neighbors import NearestNeighbors
import numpy as np

pc1 = np.array([[-1, -1], [-2, -1], [1, 1], [2, 2], [3, 5]])
pc2 = np.array([[0, 0], [0, 0], [6,6]])

nbrs = NearestNeighbors(n_neighbors=1, algorithm='kd_tree', metric='euclidean').fit(pc2)
distances, indices = nbrs.kneighbors(pc1)

This is the output:

indices = [[0],[0],[0],[0],[2]]

But I want to have:

indices_wanted = [[0],[1],[2]]

which should refer to the points in pc1: [[-1, -1], [1, 1], [3, 5]]

Is there an efficient way of doing that? My pointclouds are in 3D and I have approximately 8000 points per cloud. It should be very fast, as I need to repeat this procedure for each frame in some "movie" with 200 frames. Create some sample data in 3D:

    pc1 = np.random.randn(8000,3)
    pc2 = np.random.randn(7990,3)

Here is an image to clarify what the situation is:

The red points are pc1 and the green ones are pc2. Ultimately, I have 3D points.

Edit:

I am not restricted to sklearn, I just know it is very efficient. So KDTree is also an option. I may have to include a maximum distance for two nodes for speeding up this approach. The cube I am using is of size 4m (I already delete points that are too far off before running the neighbor search)

Problem: The code provided by @user2874583 takes around 0.8s for one set with 8000 points. That is way too slow. I need less than 0.2s. Is there a way to modify the code to making use of the structure: Cube with no points outside? Maybe sorting the array?

1) are you comitted to sklearn, or can you use `scipy`'s `KDTree` natively? 2) is there a maximum distance that you don't want to go above for two nodes? — Daniel F, Sep 17 '19 at 09:15
Why should `[1]` correspond to `[-2,-1]` when `[1,1]` is closer? — Daniel F, Sep 17 '19 at 09:34
Why is indices_wanted = [[0],[1],[2]] corresponding with [[-1, -1], [-2, -1], [3, 5]]? — user2874583, Sep 17 '19 at 09:38
1) Sorry, just changed it to [1,1]. 2) Indices are not needed. But for debugging it would be nice to have them. 3) The second pointcloud is slightly smaller than the first one. It is approximately 10 points less, so 7990. (but the precise number is changing each frame) — Peer Breier, Sep 17 '19 at 11:01
From algorithmic point of view, the most efficient algorithm for this type of problem I think is [octree](https://en.wikipedia.org/wiki/Octree). — Shihab Shahriar Khan, Sep 17 '19 at 18:06

score 2 · Accepted Answer · answered Sep 17 '19 at 11:18

2

You can use the scipy libary to calculate the distance between to points.

from scipy.spatial.distance import cdist

def closest_node_index(node, nodes):
    index = cdist([node], nodes).argmin()
    return index

final = []
for arr in pc2:
    i = closest_node_index(arr, pc1)
    final.append(pc1[i])
    pc1 = np.delete(pc1, i, axis=0)

answered Sep 17 '19 at 11:18

user2874583

537
1
4
13

1

Wow, great answer. Thanks. This is exactly what I need, BUT: One run takes ~0.8s on my machine, without loading the points. The total procedure for my 200 frames took: 196.88s. Unfortunately that is way too much. I can spend like max~20s in total. Any idea for a speed up? – Peer Breier Sep 17 '19 at 11:57
I just added some sample data to create 8000 points. Maybe you know an implementation that does the same thing but faster? I output the distances like: distance.append(scipy.spatial.distance.euclidean([arr], pos_ref[i])) - Thanks again for your great work, it has helped me a lot! – Peer Breier Sep 17 '19 at 15:32
1

@PeterBreier Unfortunately you can't really leverage `KDTree` unless you have some upper threshhold of distance you don't want to connect with. Without KDTree you're stuck with an exhaustive search (as above), which takes time. You could KDtree some subset and then exhaustive search the rest, but your timings will not be low consistently - sometimes they'll take even longer than exhaustive. Your problem is you want to iteratively remove members from one set, and a KDTree is not very mutable. For distance formulations you can have mutable or fast, but usually not both. – Daniel F Sep 18 '19 at 06:21
1

you could however speed up this code by using `index = cdist([node], nodes, 'sqeuclidean').argmin()`. The minimum squared euclidean distance will correspond to the minimum euclidean distance, but you won't need to calculate the time-costly square root. – Daniel F Sep 18 '19 at 06:36
That is already very helpful! I think I have to go for cKDTree, and take some large distance. But then again there is the problem with deleting the used points... – Peer Breier Sep 18 '19 at 09:21

Use each neighbor once in sklearn NearestNeighbor

1 Answers1