7

I have several points stored in an array. I need to find bounds of that points ie. the rectangle which bounds all the points. I know how to solve this in plain Python.

I would like to know is there a better way than the naive max, min over the array or built-in method to solve the problem.

points = [[1, 3], [2, 4], [4, 1], [3, 3], [1, 6]]
b = bounds(points) # the function I am looking for
# now b = [[1, 1], [4, 6]]
Reblochon Masque
  • 35,405
  • 10
  • 55
  • 80
ryanafrish7
  • 3,222
  • 5
  • 20
  • 35
  • 2
    Share how you would solve it in Python? We could try to improve upon it. How about : `np.min(points,0) and np.max(points,0)`? – Divakar Sep 21 '17 at 04:27
  • 1
    Unless your data points have some kind of ordering already, you can't do better than O(n). So you may as well use naive min and max approach. – wim Sep 21 '17 at 04:33
  • @Divakar That helped – ryanafrish7 Sep 21 '17 at 04:33

3 Answers3

31

My approach to getting performance is to push things down to C level whenever possible:

def bounding_box(points):
    x_coordinates, y_coordinates = zip(*points)

    return [(min(x_coordinates), min(y_coordinates)), (max(x_coordinates), max(y_coordinates))]

By my (crude) measure, this runs about 1.5 times faster than @ReblochonMasque's bounding_box_naive(). And is clearly more elegant. ;-)

cdlane
  • 40,441
  • 5
  • 32
  • 81
5

You cannot do better than O(n), because you must traverse all the points to determine the max and min for x and y.

But, you can reduce the constant factor, and traverse the list only once; however, it is unclear if that would give you a better execution time, and if it does, it would be for large collections of points.

[EDIT]: in fact it does not, the "naive" approach is the most efficient.

Here is the "naive" approach: (it is the fastest of the two)

def bounding_box_naive(points):
    """returns a list containing the bottom left and the top right 
    points in the sequence
    Here, we use min and max four times over the collection of points
    """
    bot_left_x = min(point[0] for point in points)
    bot_left_y = min(point[1] for point in points)
    top_right_x = max(point[0] for point in points)
    top_right_y = max(point[1] for point in points)

    return [(bot_left_x, bot_left_y), (top_right_x, top_right_y)]

and the (maybe?) less naive:

def bounding_box(points):
    """returns a list containing the bottom left and the top right 
    points in the sequence
    Here, we traverse the collection of points only once, 
    to find the min and max for x and y
    """
    bot_left_x, bot_left_y = float('inf'), float('inf')
    top_right_x, top_right_y = float('-inf'), float('-inf')
    for x, y in points:
        bot_left_x = min(bot_left_x, x)
        bot_left_y = min(bot_left_y, y)
        top_right_x = max(top_right_x, x)
        top_right_y = max(top_right_y, y)

    return [(bot_left_x, bot_left_y), (top_right_x, top_right_y)]

profiling results:

import random
points = [(random.randrange(-1000, 1000), random.randrange(-1000, 1000))  for _ in range(1000000)]

%timeit bounding_box_naive(points)
%timeit bounding_box(points)

size = 1,000 points

1000 loops, best of 3: 573 µs per loop
1000 loops, best of 3: 1.46 ms per loop

size = 10,000 points

100 loops, best of 3: 5.7 ms per loop
100 loops, best of 3: 14.7 ms per loop

size 100,000 points

10 loops, best of 3: 66.8 ms per loop
10 loops, best of 3: 141 ms per loop

size 1,000,000 points

1 loop, best of 3: 664 ms per loop
1 loop, best of 3: 1.47 s per loop

Clearly, the first "not so naive" approach is faster by a factor 2.5 - 3

Reblochon Masque
  • 35,405
  • 10
  • 55
  • 80
  • +1, but I am curious how the performance of an inline ternary statement compares to a two-element `min` call -- or, just an `if: (update assignment)` in the case it's larger/smaller – jedwards Sep 21 '17 at 05:02
  • 2
    4 loops and 1 comparison inside each loop vs 1 loop and 4 comparisons inside the loop. I think it's just "moving work around". If you really want speed, you should be looking at a numba JIT or something like that. – wim Sep 21 '17 at 05:13
  • hehehe, that was my guess too, but after your comment, I had to go back to it and measure it. Thanks for pushing me @wim. (the results are posted above) – Reblochon Masque Sep 21 '17 at 05:26
0

You can do the extraction of the bounding box faster by using numpy, in particular, if assuming that there's an additional benefit in converting your points to an array.

def bounding_box_numpy(points: np.array):
    """
    Find min/max from an N-collection of coordinate pairs, shape = (N, 2), using 
    numpy's min/max along the collection-axis 
    """
    return [*points.min(axis=0), *points.max(axis=0)]


import random
points = [(random.randrange(-1000, 1000), random.randrange(-1000, 1000))  for _ in range(1000000)]
numpy_points = np.array(points)  # see the comment in the end *)
print(numpy_points.shape)  # prints (1000000, 2)

Then (see the earlier answer https://stackoverflow.com/a/46335659/10980510 by @Reblochon Masque)

%timeit bounding_box_naive(points)
%timeit bounding_box(points)
%timeit bounding_box_numpy(np_points)

will return profiling results

136 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
274 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20.7 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

*) In all fairness, however, the conversion of the list of point pairs into numpy array takes hundreds of milliseconds.

mjkvaak
  • 399
  • 3
  • 6