Python nearest neighbour - coordinates

Question

I wanted to check I was using scipy's KD tree correctly because it appears slower than a simple bruteforce.

I had three questions regarding this:

Q1.

If I create the following test data:

nplen = 1000000
# WGS84 lat/long
point = [51.349,-0.19]
# This contains WGS84 lat/long
points = np.ndarray.tolist(np.column_stack(
        [np.round(np.random.randn(nplen)+51,5),
         np.round(np.random.randn(nplen),5)]))

And create three functions:

def kd_test(points,point):
    """ KD Tree"""
    return points[spatial.KDTree(points).query(point)[1]]

def ckd_test(points,point):
    """ C implementation of KD Tree"""
    return points[spatial.cKDTree(points).query(point)[1]]

def closest_math(points,point):
    """ Simple angle"""
    return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))[1:3]

I would expect the cKD tree to be the fastest, however - running this:

print("Co-ordinate: ", f(points,point))
print("Index: ", points.index(list(f(points,point))))
%timeit f(points,point)

Result times - the simple bruteforce method is faster:

closest_math: 1 loops, best of 3: 3.59 s per loop
ckd_test: 1 loops, best of 3: 13.5 s per loop
kd_test: 1 loops, best of 3: 30.9 s per loop

Is this because I am using it wrong - somehow?

Q2.

I would assume that the even to get the ranking (rather than distance) of closest points one still needs to project the data. However, it seems that the projected and un-projected points give me the same nearest neighbour:

def proj_list(points,
              inproj = Proj(init='epsg:4326'),
              outproj = Proj(init='epsg:27700')):
    """ Projected geo coordinates"""
    return [list(transform(inproj,outproj,x,y)) for y,x in points]
proj_points = proj_list(points)
proj_point = proj_list([point])[0]

Is this just because my spread of points is not big enough to introduce distortion? I re-ran a few times and still got the same index out of the projected and un-projected lists being returned.

Q3.

Is it generally faster to project the points (like above) and calculate the hypotenuse distance compared to calculating the haversine or vincenty distance on (un-projected) latitude/longitudes? Also which option would be more accurate? I ran a small test:

from math import *
def haversine(origin,
              destination):
    """
    Find distance between a pair of lat/lng coordinates
    """
    lat1, lon1, lat2, lon2 = map(radians, [origin[0],origin[1],destination[0],destination[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    r = 6371000  # Metres
    return (c * r)

def closest_math_unproj(points,point):
    """ Haversine on unprojected """
    return (min((haversine(point,pt),pt[0],pt[1]) for pt in points))

def closest_math_proj(points,point):
    """ Simple angle since projected"""
    return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))

Results:

So this seems to say that projecting and then doing distance is faster than not - however, I am not sure which method will bring more accurate results.

Testing this on an online vincenty calculation is seems the projected co-ordinates are the way to go:

One mostly unrelated suggestion: using `%timeit -n 10 f(points,point)` is probably more convenient than doing `%timeit for x in range(10): f(points,point)`. — Martin Valgur, Feb 05 '16 at 15:42
It might be worth taking a look at https://github.com/storpipfugl/pykdtree, by the way. This will probably not solve the efficiency issues vs the brute force approach, but will probably be a bit faster than the scipy's default implementation. — Martin Valgur, Feb 05 '16 at 16:02

score 1 · Accepted Answer · answered Feb 05 '16 at 17:14

Q1.

The reason for the apparent inefficiency of the k-d tree is quite simple: you are measuring both the construction and querying of the k-d tree at once. This is not how you would or should use a k-d tree: you should construct it only once. If you measure only the querying, the time taken reduces to mere tens of milliseconds (vs seconds using the brute-force approach).

Q2.

This will depend on the spatial distribution of the actual data being used and the projection being used. There might be slight differences based on how efficient the implementation of the k-d tree is at balancing the constructed tree. If you are querying only a single point, then the result will be deterministic and unaffected by the distribution of points anyway.

With the sample data that you are using, which has strong central symmetry, and with your map projection (Transverese Mercator), the difference should be negligible.

Q3.

Technically, the answer to your question is trivial: using the Haversine formula for geographic distance measurement is both more accurate and slower. Whether the tradeoff between accuracy and speed is warranted depends heavily on your use case and the spatial distribution of your data (mostly on the spatial extent, obviously).

If the spatial extent of your points is on the small, regional side, then using a suitable projection and the simple Euclidean distance measure might be accurate enough for your use case and faster than using the Haversine formula.

Thanks Martin - this answers everything. I just wanted to check that you said the Haversine formula would be more accurate (and so by extension the vincenty formula). Which implies if accuracy is very important then a vectorised numpy vincenty formula is the way to go? — mptevsion, Feb 05 '16 at 17:17
Ah, sorry. I misread the last question and missed that you were asking about haversine or vincenty formulas. You can diregard my last answer. The last question is probably much better suited for gis.stackexchange.com than SO as such. — Martin Valgur, Feb 05 '16 at 17:21
Sorry what I'm getting at is - if I have (e.g) 10 million co-ordinates in the UK and my main goal is to minimise the error in distances (+- 1 metre is great) then I should be using scipy.pdist with a vectorised vincenty formula as opposed to projecting the co-ordinates and then running a vectorised euclidean distance? — mptevsion, Feb 05 '16 at 17:21

Python nearest neighbour - coordinates

1 Answers1

Q1.

Q2.

Q3.