Can I compute closest split pair of points where distance is strictly less than delta

Question

The Closest Pair of Points Problem has been intriguing me lately. More specifically, the divide and conquer algorithm.

This recursive algorithm requires me to split a set of points into two chunks, a and b and compute the closest pair of points for each chunk, then compute the closest pair of points between the chunks and return the minimum of these 3 quantities.

The algorithm to compute the closest point between the chunks, works by only iterating over at most 7 points that are at most min(a, b) away from the last point in a (meadian element of the whole set).

These two pictures represent the problem a little better. min(a, b) = Delta.

I understand that if I were to have 2 points on the l line on both sides, a and b, I'd need to compare at most 7 points in the middle strip.

What I am wondering though is that if I construct the middle strip with points strictly less than Delta away from l, couldn't I just compare the next 4 points, instead of 7, since I can only fit 2 points on each side of l, less than Delta away from l and Delta away from each other?

Edit: I have started a bounty for a very similar question on cs stackexchange also and we have come up with a very interesting discussion there. So I'm linking it here. We've also started a very insightful chat here.

"I can only fit 2 points on each side of l Delta away from each other". In fact, we can fit 3 points. — burnabyRails, Jan 11 '19 at 22:34
4 comparisons are enough. However, I have not found a proof that is not messy yet. — burnabyRails, Jan 11 '19 at 22:42
By the way, nice observation and good questions. Just upvoted both. — burnabyRails, Jan 11 '19 at 22:59

SergGr · Accepted Answer · 2019-01-13T17:07:47.943

Note: this answer was updated for future references basing on exensive discussions with @Danilo Souza Morães and @burnabyRails who is the author of the accepted answer on a similar question by Danilo on the CS site. The main problem with the original answer is that I assumed that a bit different algorithm is used/discussed here. Since the original answer provided some important insights to Danilo it is preserved as is at the end. Also if you are going to read the discussion mentioned by Danilo in his answer, be sure to read the Intro here first to better understand what I meant. The added preface discusses different versions of the algorithm in question.

Intro

This algorithm is based on two main ideas:

Typical recursive divide and conquer approach
The fact that if you already know the best distance in both left and right areas, you can do the "combine" step in O(N) time because you can prove that you can do checks for each point in O(1) time.

Nevertheless there are several ways to implement the same base ideas in practice, particularly the combining step. The main differences are:

Do you treat "left" and "right" points independently and possible in different ways?
For each point do you do a fixed number of checks or do you filter the candidates first and only do distance checks for the filtered candidates? Note that you can do some filtering without breaking O(N*log(N)) time if you can ensure it is done in amortized O(1). On other words if you know there is an strong upper bound on the number of good candidates, you don't have to check exactly that number of candidates.

The classical description of the algorithm in the CLRS Introduction to Algorithms clearly answers the question #1 as you mix the "left" and "right" points and treat them jointly. As for #2 the answer is less clear (to me) but it probably implies checking some fixed number of points. This seems to be the version that Danilo had in mind asking the question.

The algorithm I had in mind is different on both points: it goes up through the "left" points and check them against all the filtered "right" candidates. Obviously I was not aware of this difference while I was writing my answer and during the initial discussion in the chat.

"My" algorithm

Here is a sketch of the algorithm I had in mind.

The starting steps are common: We already have processed "left" and "right" points and know the best sofar, and we have them sorted along the Y axis. We also filtered out all the points that do not lie in the ± strips.
The outer loop is that we go through the "left" points up. Now assume we process one, let's call it "the left point". Additionally and semi-independently we iterate over the "right" points moving the "start position" when necessary (see step #2).
For each left point move up from the last start position in the right points and increment the start position until we get into the range of (or rather -) difference with the left point according to the Y axis. (Note: separation of the "left" and the "right" points is what makes it necessary to start at -)
Now from that start position continue going up and calculate distance to all the points that lie before the + with the current left point.

The filtering done in steps #2 and #3 is what makes it "data-dependent". The trade-off in this implementation is that you do less distance checks at a cost of more Y-checks. Also the code is arguable more complicated.

Why this combining algorithm works in O(N)? For the same reason that there is a fixed bound (i.e. O(1)) on how many points you can fit into a rectangle x2 such that the distance between each two points is at least . It implies that there will be O(1) checks made at the step #3. In practice all the distances checked by this algorithm will be checked by the CLRS-version and depending on the data some more might also be checked.

The amortized cost of the step #2 is also O(1) because the total cost of the step #2 over the whole outer loop is O(n): you clearly can't move the start position up more times than there are "right points" in total.

Modified CLRS algorithm

You can easily make a different decision on #2 even in the CLRS-version of the algorithm. The description of the crucial step says:

For each point p in the array Y′, the algorithm tries to find points in Y′ that are within δ units of p. As we shall see shortly, only the 7 points in Y′ that follow p need be considered. The algorithm computes the distance from p to each of these 7 points and keeps track of the closest-pair distance δ′ found over all pairs of points in Y′

It is easy to modify it to check not 7 points but first check whether the difference in Y is less than . Obviously you are still guaranteed to need to check not more than 7 points. Again the trade-off is that you do less distance checks but do some Y-difference checks. Depending on your data and on relative performance of different operations on your hardware it might be a good or a bad trade-off.

Some other important ideas

If you don't need to find all the pairs with the minimal distance, you can safely use <`` instead of<=` while filtering
In the world of the real hardware with floating point representation of the numbers the idea of = is not that clear. Usually you need to check something like abs(a - b) < έ anyway.
Idea behind my counter-example (to a different algorithm): you don't have to put all the points on the edges. An equilateral triangle with side can fit into a square with size -έ (which is the other way to say <)

Original Answer

I don't think that you correctly understand the meaning of that constant 7 in this algorithm. In practice you don't check 4 or 7 or any fixed number of points, you go up (or down) along the Y axis and check the ones that fall into the corresponding rectangle. There easily might be just 0 such points. What is really important for the algorithm to work in promised O(n*log(n)) time is that there is a fixed upper bound on the number of points checked at that step. Any constant upper bound would work as long as you can prove it. In other words it is important that that step is O(n), not the particular constant. 7 is just one that is relatively easy to prove and that's all.

I believe the practical reason for 7 is that on real hardware you can't make exact computations of distances in floating-point data types, you'll bound to have some rounding errors. That's why using < instead of <= is not really practical.

And finally I don't think that 4 is a correct bound assuming you can reliably do <. Let's assume = 1. Let's the "left" point be (-0.0001; 0) so the "right" < rectangle for it is 0 <= x < 1 and -1 < y < 1. Consider following 5 points (the idea is: 4 points almost in the corners just to fit the rectangle < and the distances between them > , and the 5th is at the center of the rectangle):

P1 = (0.001; 0.999)
P2 = (0.999; 0.93)
P3 = (0.001; -0.999)
P4 = (0.999; -0.93)
P5 = (0.5; 0)

Note that those 5 points should be more than between themselves, some of them might be less than from the "left" point. This is why we check for them in the first place.

Distance P1-P2 (and symmetrically P3-P4) is

sqrt(0.998^2 + 0.069^2) = sqrt(0.996004 + 0.004761) = sqrt(1.000765) > 1

Distance P1-P5 (and symmetrically P3-P5) is

sqrt(0.499^2 + 0.999^2) = sqrt(0.249001 + 0.998001) = sqrt(1.247002) > 1

Distance P2-P5 (and symmetrically P4-P5) is

sqrt(0.499^2 + 0.93^2) = sqrt(0.249001 + 0.8649) = sqrt(1.113901) > 1

So you can fit 5 points in such a < rectangle which is clearly more than 4.

I'm not sure how you drew the conclusion that I don't understand the algorithm. I'm not arguing that 4 comparisons would render an asymptotically faster algorithm. I'm just asking whether the constant 7, as presented in wikipedia, could be trimmed down to 4 GIVEN that that's the maximum number of points that would fit in a rectangle of sides less than . I presented all of these fact in the question to show that I fully understand how the algorithm works but I am looking for a way to prove that this minor improvement that I presented, actually works. — Danilo Souza Morães, Jan 12 '19 at 02:17
@DaniloSouzaMorães, OK, I might have misunderstood you, sorry. Still I don't like the fact that you call it "improvement" because it doesn't really affect anything. Anyway do you see and issues with my counter-example with 5 points at the end of my answer? IMHO it clearly disproves the whole idea of trimming down to 4. — SergGr, Jan 12 '19 at 11:04
Your answer is exactly what I was looking for. I haven't accepted it before because I spent quite some time studying and plotting it. I also got a good answer on the other question I asked and so I was just making some remarks there before moving to SO to accept yours — Danilo Souza Morães, Jan 12 '19 at 11:07
And you are right, it isn't an improvement, because it doesn't work. My assumption was that I can fit at most 2 points, δ away from each other and less than δ away from l. I thought that was the case because if I were to fit 3 points in the corners of a square of side δ and then start moving the 2 points on one edge closer to the 3rd point, then that 3rd point would be force out of the square to keep the minimum distance of δ. But that is not the case as you have shown with your set of points. The 3rd point could move towards the center of the edge, ultimately forming an equilateral triangle. — Danilo Souza Morães, Jan 12 '19 at 11:13
@DaniloSouzaMorães, I don't like calling it an "improvement" not because it is false but because even if it was true, it wouldn't affect neither the actual performance of the algorithm nor the theoretical bound of `O(N*log(N))`. — SergGr, Jan 12 '19 at 11:15
I could very well build 2 algorithms that perform the same task, with the same asymptotic complexity but where the execution time would differ by a millennia. If it were true that I could solve the algorithm with only 4 iterations instead of 7 and each iteration took a day to complete, then I would've saved 3 days of computation. I use quick sort instead of merge sort, because quick sort usually runs faster in memory, even though it has the same time complexity and almost the same number of constant steps as merge sort. Quick sort is superior/improved when compared to merge sort in that sense — Danilo Souza Morães, Jan 12 '19 at 11:32
@DaniloSouzaMorães, then we still disagree on the relevance of the first half of my answer. I still insist that in practice you don't check 4 or 7 points, for every "left" point you check all the points fitting in the corresponding "right rectangle". The proof of a better bound on the number of such points doesn't affect the actual number of them and thus doesn't affect the actual performance. The actual number is data-dependent but proof-independent. — SergGr, Jan 12 '19 at 11:35
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186592/discussion-between-serggr-and-danilo-souza-moraes). — SergGr, Jan 12 '19 at 11:37
@burnabyRails, sorry, I'm not sure I understand you. Points in the middle strip are ordered by what? Which one of my 5 points will it ignore? — SergGr, Jan 12 '19 at 14:49
The points with negative y-coordinates when the current point is (something, 0). — burnabyRails, Jan 12 '19 at 14:57
@burnabyRails, sorry I still don't get the logic. Why you can ignore them? AFAICS they can as easily be closer than `` as the ones with positive `Y` (the situation is clearly absolutely symmetric). — SergGr, Jan 12 '19 at 15:00
@burnabyRails, there is already a [chat](https://chat.stackoverflow.com/rooms/186592/discussion-between-serggr-and-danilo-souza-moraes) attached to this answer, we can try to continue our discussion there. — SergGr, Jan 12 '19 at 15:01