I wanted to check I was using scipy's KD tree correctly because it appears slower than a simple bruteforce.
I had three questions regarding this:
Q1.
If I create the following test data:
nplen = 1000000
# WGS84 lat/long
point = [51.349,-0.19]
# This contains WGS84 lat/long
points = np.ndarray.tolist(np.column_stack(
[np.round(np.random.randn(nplen)+51,5),
np.round(np.random.randn(nplen),5)]))
And create three functions:
def kd_test(points,point):
""" KD Tree"""
return points[spatial.KDTree(points).query(point)[1]]
def ckd_test(points,point):
""" C implementation of KD Tree"""
return points[spatial.cKDTree(points).query(point)[1]]
def closest_math(points,point):
""" Simple angle"""
return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))[1:3]
I would expect the cKD tree to be the fastest, however - running this:
print("Co-ordinate: ", f(points,point))
print("Index: ", points.index(list(f(points,point))))
%timeit f(points,point)
Result times - the simple bruteforce method is faster:
closest_math: 1 loops, best of 3: 3.59 s per loop
ckd_test: 1 loops, best of 3: 13.5 s per loop
kd_test: 1 loops, best of 3: 30.9 s per loop
Is this because I am using it wrong - somehow?
Q2.
I would assume that the even to get the ranking (rather than distance) of closest points one still needs to project the data. However, it seems that the projected and un-projected points give me the same nearest neighbour:
def proj_list(points,
inproj = Proj(init='epsg:4326'),
outproj = Proj(init='epsg:27700')):
""" Projected geo coordinates"""
return [list(transform(inproj,outproj,x,y)) for y,x in points]
proj_points = proj_list(points)
proj_point = proj_list([point])[0]
Is this just because my spread of points is not big enough to introduce distortion? I re-ran a few times and still got the same index out of the projected and un-projected lists being returned.
Q3.
Is it generally faster to project the points (like above) and calculate the hypotenuse distance compared to calculating the haversine or vincenty distance on (un-projected) latitude/longitudes? Also which option would be more accurate? I ran a small test:
from math import *
def haversine(origin,
destination):
"""
Find distance between a pair of lat/lng coordinates
"""
lat1, lon1, lat2, lon2 = map(radians, [origin[0],origin[1],destination[0],destination[1]])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * asin(sqrt(a))
r = 6371000 # Metres
return (c * r)
def closest_math_unproj(points,point):
""" Haversine on unprojected """
return (min((haversine(point,pt),pt[0],pt[1]) for pt in points))
def closest_math_proj(points,point):
""" Simple angle since projected"""
return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))
Results:
So this seems to say that projecting and then doing distance is faster than not - however, I am not sure which method will bring more accurate results.
Testing this on an online vincenty calculation is seems the projected co-ordinates are the way to go: