18

I've got millions of geographic points. For each one of these, I want to find all "neighboring points," i.e., all other points within some radius, say a few hundred meters.

There is a naive O(N^2) solution to this problem---simply calculate the distance of all pairs of points. However, because I'm dealing with a proper distance metric (geographic distance), there should be a quicker way to do this.

I would like to do this within python. One solution that comes to mind is to use some database (mySQL with GIS extentions, PostGIS) and hope that such a database would take care of efficiently performing the operation described above using some index. I would prefer something simpler though, that doesn't require me to build and learn about such technologies.

A couple of points

  • I will perform the "find neighbors" operation millions of times
  • The data will remain static
  • Because the problem is in a sense simple, I'd like to see they python code that solves it.

Put in terms of python code, I want something along the lines of:

points = [(lat1, long1), (lat2, long2) ... ] # this list contains millions lat/long tuples
points_index = magical_indexer(points)
neighbors = []
for point in points:
    point_neighbors = points_index.get_points_within(point, 200) # get all points within 200 meters of point
    neighbors.append(point_neighbors) 
conradlee
  • 12,985
  • 17
  • 57
  • 93
  • 1
    Do you need to perform this operation multiple times (so that it might be useful to do some (hard) work once and each time you need some point do a simpler computation)? Do you need to fetch neighbours for multiple points or is the "center" every time the same point? – Martin Thurau Jun 16 '11 at 11:41
  • I want to perform this query millions of time. In fact, I want to find the neighbors of each and every point. – conradlee Jun 16 '11 at 11:47
  • 1
    Will the geographic points be static for the duration of the application, or will they be different every time the query is executed? – Patrick Jun 16 '11 at 11:51
  • I'm not making sense of why using a GIST index over your points in Postgres is not desirable. Surely you don't want to be recalculating the neighbors of each and every last million point all the time? – Denis de Bernardy Jun 16 '11 at 11:53
  • The points will remain static. As you mention, a GIS index in Postgres would probably do the trick, but I don't know my way around Postgres and would prefer a simpler soultion that doesn't require me to learn and build other technologies. – conradlee Jun 16 '11 at 12:01

2 Answers2

8

scipy

First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.

Binary Search

Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:

  1. Compute the PCA of the dataset. You want to rotate the dataset such that the most significant direction is first, and the orthogonal (less large) second direction is, well, second. You can skip this and just choose X or Y, but it's computationally cheap and usually easy to implement. If you do just choose X or Y, choose the direction with greater variance.
  2. Sort the points by the major direction (call this direction X).
  3. To find the nearest neighbor of a given point, find the index of the point nearest in X by binary search (if the point is already in your collection, you may already know this index and don't need the search). Iteratively look to the next and previous points, maintaining the best match so far and its distance from your search point. You can stop looking when the difference in X is greater than or equal to the distance to the best match so far (in practice, usually very few points).
  4. To find all points within a given range, do the same as step 3, except don't stop until the difference in X exceeds the range.

Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.

One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.

Delauney triangulation

Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...

Eamon Nerbonne
  • 47,023
  • 20
  • 101
  • 166
  • I think you're on the track to the right solution with your suggestion of k-d trees. I'm now taking a look at scipy.spatial.cKDTree which seems to be an implementation that will be simple to use in a straightforward, pythonic manner. – conradlee Jun 16 '11 at 12:26
  • Yep, that looks like a fine implementation, it even supplies a range-limiting query parameter! - link added to answer. – Eamon Nerbonne Jun 16 '11 at 12:33
  • I would accept your answer if you include a bit of python code that uses that implementation to solve the problem... – conradlee Jun 16 '11 at 12:42
7

Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.

from scipy.spatial import cKDTree
from scipy import inf

max_distance = 0.0001 # Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)

point_neighbors_list = [] # Put the neighbors of each point here

for point in points:
    distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
    point_neighbors = []
    for index, distance in zip(indices, distances):
        if distance == inf:
            break
        point_neighbors.append(points[index])
    point_neighbors_list.append(point_neighbors)
conradlee
  • 12,985
  • 17
  • 57
  • 93
  • 2
    Hi, @conradlee. How do you figure it out this measure of distance? I mean, if I like to use 2km, for example, how would I figure it out the value of max_distance? Thanks. – pceccon Apr 14 '16 at 14:53
  • 1
    The meters conversion of lat/long degree actually depends on the latitude and is different for lats and lons, so this is only a rough conversion. However, you can come up with a useful hack for your data such as: At 40° north or south the distance between one degree of longitude is 85 km. – Aaron Bramson Feb 22 '19 at 07:18
  • 2
    However, because of the differences in degree->meters for lat and lon, and changes at different lats, this solution is only approximate. However, there doesn't seem to be a way to use a custom distance function (like Haversine) in the scipy or sklearn KDTree implementations. – Aaron Bramson Feb 22 '19 at 07:21