4

List1 contains a high number (~7^10) of N-dimensional points (N <=10), List2 contains the same or fewer number of N-dimensional points (N <=10).

My task is this: I want to check which point in List2 is closest (euclidean distance) to a point in List1 for every point in List1 and subsequently perform some operation on it. I have been doing it the simple- the nested loop way when I didn't have more than 50 points in List1, but with 7^10 points, this obviously takes up a lot of time.

What is the fastest way to do this? Any concepts from Computational Geometry might help?

EDIT: I have the following in place, I have built a kd-tree out of List2 and then now I am doing a nearest-neighborhood search for each point in List1. Now as I originally pointed out, List1 has 7^10 points, and hence though I am saving on the brute force, Euclidean distance method for every pair, the sheer large number of points in List1 is causing a lot of time consumption. Is there any way I can improve this?

VividD
  • 10,456
  • 6
  • 64
  • 111
  • Can you profile your code to see where (and if) the bulk of the time is being spent? That is, if 80% of the time is doing the split, you can try simplifying the split code. If 80% is doing the distance compare you might try improving the compare (e.g.: if the distance on one axis is greater than your current minimum total, don't bother checking the other dimensions), etc. Let's look where to optimize. – mpez0 May 26 '10 at 16:39

4 Answers4

5

Well a good way would be to use something like a kd-tree and perform nearest neighbour searching. Fortunately you do not have to implement this data structure yourself, it has been done before. I recommend this one, but there are others:

http://www.cs.umd.edu/~mount/ANN/

PeterK
  • 6,287
  • 5
  • 50
  • 86
  • Since my existing code is in Python, I am not going to use this library. I am looking to use http://gamera.sourceforge.net/doc/html/kdtree.html. But, the most important thing is the concept of using kd-trees for the same. Thanks! –  May 26 '10 at 07:50
  • PeterK: Sorry for toggling the "Accept". Can you please look at the edited question now? –  May 26 '10 at 16:13
  • I think we are running out of options. There are two possibilities i think: first, you could use some kind of approximation while searching for NN. The mentoined ANN library allows this, don't know about the others though. Second option would be to somehow preprocess the List1 - like computing "clusters" of some size. By clusters i mean points which are very similar, thus one of them could be picked as a representant of the cluster. If you then run a NN search for more than one neighbour, you might be able to get good results (but it still depends on what exactly you want to do). – PeterK May 26 '10 at 21:10
  • Oh, one more thing: you could go the parallel way and do more searches at a time. This of course would only be reasonable if you have a multicore processor and your systems allows threading. Moreover, you need the kd-tree to be thread-safe, at least when it comes to searching. – PeterK May 26 '10 at 21:11
2

It's not possible to tell you which is the most efficient algorithm without knowing anything about the distribution of points in the two solutions. However, for a first guess...

First algorithm doesn't work — for two reasons: (1) a wrong assumption - I assume the bounding hulls are disjoint, and (2) a misreading of the question - it doesn't find the shortest edge for every pair of points.

...compute the convex hull of the two sets: the closest points must be on the hyperface on the two hulls through which the line between the two centres of gravity passes.

You can compute the convex hull by computing the centre points, the centre of gravity assuming all points have equal mass, and ordering the lists from furthest from the centre to least far. Then take the furthest away point in the list, add this to the convex hull, and then remove all points that are within the so-far computed convex hull (you will need to compute lots of 10d hypertriangles to do this). Repeat unil there is nothing left in the list that is not on the convex hull.

Second algorithm: partial

Compute the convex hull for List2. For each point of List1, if the point is outside the convex hull, then find the hyperface as for first algorithm: the nearest point must be on this face. If it is on the face, likewise. If it is inside, you can still find the hyperface by extending the line past the point from List1: the nearest point must be inside the ball that includes the hyperface to List2's centre of gravity: here, though, you need a new algorithm to get the nearest point, perhaps the kd-tree approach.

Perfomance

When List2 is something like evenly distributed, or normally distributed, through some fairly oblique shape, this will do a good job of reducing the number of points under consideration, and it should be compatible with the kd-tree suggestion.

There are some horrible worts cases, though: if List2 contains only points on the surface of a torus whose geometric centre is the centre of gravity of the list, then the convex hull will be very expensive to calculate, and will not help much in reducing the number of points under consideration.

My evaluation

These kinds of geometric techniques may be a useful complement to the kd-trees approach of other posters, but you need to know a little about the distribution of points before you can determine whether they are worth applying.

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
  • Even if the problem were to find the closest pair of points overall (which it isn't -- he actually wants the closest partner for *every* point in L1), the closest pair need not be on either hull. Some point near the centre of L1 could be closer to some point near the centre of L2 than any hull-point of L1 is to any hull-point of L2. – j_random_hacker May 26 '10 at 07:03
  • @j_random: I've edited the post to reflect your comment, and describe a revised algorithm. – Charles Stewart May 26 '10 at 08:15
  • Charles: I am sorry that I don't understand much of Computational geometry, hence the convex hull approach may not be something I can quickly use. Thanks a lot for your answer. –  May 26 '10 at 16:14
  • Kudos for the update. But... :) If the volumes of the hulls are disjoint, the nearest point to any point on L1's hull must be somewhere on L2's hull, but that point might be somewhere along a line segment rather than at a corner point defined by an actual point from L2. (Much easier to show this on a diagram I'm afraid.) Also not convinced yet that lines through the centres of mass have the property you ascribe (though it may be a good heuristic) -- you can move the centre of mass anywhere by adding many points near the boundary. (Sorry to be picky... :)) – j_random_hacker May 27 '10 at 02:01
  • @j_random "the nearest point to any point on L1's hull must be somewhere on L2's hull" - Did I really say that! Actually, even what you said need not be true: the nearest point can be just inside the hull. But it must be within the minimum bounding ball of that hyperface. The second property is true, even when L1 & L2 are parallel torii: in this case the ball might be quite large! – Charles Stewart May 27 '10 at 13:44
0

kd-tree is pretty fast. I've used the algorithm in this paper and it works well Bentley - K-d trees for semidynamic point sets

I'm sure there are libraries around, but it's nice to know what's going on sometimes - Bentley explains it well.

Basically, there are a number of ways to search a tree: Nearest N neighbors, All neighbors within a given radius, nearest N neighbors within a radius. Sometimes you want to search for bounded objects.

The idea is that the kdTree partitions the space recursively. Each node is split in 2 down the axis in one of the dimensions of the space you are in. Ideally it splits perpendicular to the node's longest dimension. You should keep splitting the space until you have about 4 points in each bucket.

Then for every query point, as you recursively visit nodes, you check the distance from to the partition wall for the particular node you are in. You descend both nodes (the one you are in and its sibling) if the distance to the partition wall is closer than the search radius. If the wall is beyond the radius, just search children of the node you are in.

When you get to a bucket (leaf node), you test the points in there to see if they are within the radius.

If you want the closest point, you can start with a massive radius, and pass a pointer or reference to it as you recurse - and in that way you can shrink the search radius as you find close points - and home in on the closest point pretty fast.

Julian Mann
  • 6,256
  • 5
  • 31
  • 43
  • Actually, you don't really need to split according to the longest dimension, one technic which work pretty well is to split on each dimension alternatively, always in the same order. It's perhaps not the most efficient one, but it's predictable and this has some advantages too. – Matthieu M. May 26 '10 at 09:18
  • Thanks! I think I'll try that. I guess it will speed up the build because figuring out which is the longest dimension is probably expensive ? - even with the optimization in Bentley's paper- i.e. (using a subset of the points to measure). – Julian Mann May 26 '10 at 09:44
  • Figuring out which dimension has the biggest spread is really expensive. Introducing an approximation (checking a subset of the points) gives you a considerable speedup (depending on the size of the subset of course). From my experience, it is entirely worth it. – PeterK May 26 '10 at 15:23
0

(A year later) kd trees that quit early, after looking at say 1M of all 200M points, can be much faster in high dimensions.
The results are only statistically close to the absolute nearest, depending on the data and metric; there's no free lunch.
(Note that sampling 1M points, and kd tree only those 1M, is quite different, worse.)

FLANN does this for image data with dim=128, and is I believe in opencv. A local mod of the fast and solid SciPy cKDTree also has cutoff= .

denis
  • 21,378
  • 10
  • 65
  • 88