Approximated closest pair algorithm

Question

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).

Consider 4 points (A, B, C, D), and the following distances:

dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2

In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.

Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.

What kind of solution should I look for? Was this kind of problem solved before?

This is not a classroom problem or something related. I have been just thinking about this problem.

Why "the distance between dist(A,B)=0.5, dist(A,C)=5, dist(C,D)=2, for instance, I don't need to evaluate dist(B,C) and dist(A,D)" ? I think this is similar to shortest path problem,when the relationship of these distances is not transitive, thus your statement is not correct — Pham Trung, Dec 27 '13 at 03:02
@PhamTrung Given `dist(A,B) = 0.5; dist(A,C) = 5`, the closest B and C can possibly be is if A, B, and C are colinear, and in that order. This leads to `dist(B,C) >= 4.5`, which is greater than the current shortest path, so there's no need to evaluate. — Trojan, Dec 27 '13 at 03:08
@DanielTheRocketMan For your very specific example, it works, but in the case when dist(A,B) = 3 , this method will be not sufficient. So the searching space can be reduced but not much I think — Pham Trung, Dec 27 '13 at 03:18
@PhamTrung That depends on the distribution of the 10^8 points. You could be right for very densely distributed points. Food for thought. — Trojan, Dec 27 '13 at 03:24
@PhamTrung I agree with you that the quality of the solution may depend on the distribution of points! — DanielTheRocketMan, Dec 27 '13 at 03:32
Runtime will depend not only on the distribution of the points but also the order in which you consider (pairs of) them, and this unfortunately means the worst-case runtime will be O(n^2), assuming that "Distance from a to b?" is the only query we are allowed. E.g. if you have a point at (0, 0), another point at (d, 0) (where d is some small value close to 0), and n-2 points distributed evenly around the circumference of a circle of radius r centred at (0, 0), with r large enough that d is smaller than the distance between any 2 points on the circle, then ... — j_random_hacker, Dec 27 '13 at 11:56
... the nearest neighbour pair will be (0, 0) and (d, 0) -- but nothing else can "tell" us to consider this pair of points early on, so in the worst case it will be considered last. Also in this worst case, we will need to consider O(n^2) other points first: each point on the circle with the point diagonally opposite, then with the points on either side of that point, eventually comparing it with its neighbours on either side. — j_random_hacker, Dec 27 '13 at 12:00
(That's not a proof, or even a sketch, but I think it could be used as a starting point for one. In the case where dimension = number of points, it's much easier to establish the necessity of comparing O(n^2) point pairs in the worst case.) — j_random_hacker, Dec 27 '13 at 12:02

score 2 · Answer 1 · answered Dec 27 '13 at 04:01

2

If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E). Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.

Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.

answered Dec 27 '13 at 04:01

RichardPlunkett

2,998
14
14

Sorry, @RichardPlunkett I didnt understand very well your assertion! What do you exactly mean about edge? Are they the previously calculates distances? I also didnt understand the second assertion! – DanielTheRocketMan Dec 28 '13 at 13:06
1

What he is saying is solution O(nlogn) applies only for planar cases. In general case you have O(E) input. Unless you are reading the input completely (which takes O(E) time), you cannot get a solution. – ElKamina Jan 01 '14 at 04:35

score 2 · Accepted Answer · edited May 23 '17 at 11:56

I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.

The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".

Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.

My implementation is here: https://github.com/jon1van/MTreeMapRepo

Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.

I suggest using my code (or the idea behind it) to implement a solution in which you:

Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.

This style of solution would be a "divide and conquer" approach the returns an approximate solution.

You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.

Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate) Good Luck

That's a really neat lil' data structure. Thanks for sharing! — Andy Jones, Jan 02 '14 at 10:38

score 1 · Answer 3 · answered Dec 30 '13 at 23:01

Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.

That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Approximated closest pair algorithm

3 Answers3