how do I return all points pairs that satisfy a distance threshold criteria using haversine distance matrix

Question

I have 39,803 different lat lon points (here are the first 5 points)

lat lon
0   27.269987   -82.497004
1   27.537598   -82.508422
2   27.337793   -82.533753
3   27.497719   -82.570261
4   27.512062   -82.722158

That I've calculated the haversine distance matrix for

array([[  0.        ,  29.77832527,   8.36846516, ...,  30.07072193,
      7.60700598,   7.63477669],
   [ 29.77832527,   0.        ,  22.35749757, ...,   2.6836159 ,
     22.17639199,  23.07090099],
   [  8.36846516,  22.35749757,   0.        , ...,  23.07825172,
      3.10333262,   0.75441483],
   ..., 
   [ 30.07072193,   2.6836159 ,  23.07825172, ...,   0.        ,
     22.53766911,  23.75965211],
   [  7.60700598,  22.17639199,   3.10333262, ...,  22.53766911,
      0.        ,   3.03795035],
   [  7.63477669,  23.07090099,   0.75441483, ...,  23.75965211,
      3.03795035,   0.        ]])

thus my matrix is 39,803 by 39,803. I would like to figure out a way to find all pairs that had a distance of less than 25 meters. For instance, if we consider the first array

[  0.        ,  29.77832527,   8.36846516, ...,  30.07072193,
  7.60700598,   7.63477669]

the pair

(lat[0],lon[0])-(lat[2],lon[2])=(27.269987,-82.497004)-(27.337793,-82.533753) = 8.36846516

satisfies this criteria but

(lat[0],lon[0])-(lat[1],lon[1])=(27.269987,-82.497004)-(27.537598,-82.508422) = 29.77832527

does not. I'd like to get a subset of points that satisfy this criteria. This is what I have so far:

X=df[['lat','lon']].dropna(axis=0) 
coors=np.array(X)

from math import radians, cos, sin, asin, sqrt

from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

import matplotlib.pyplot as plt
import pandas as pd


def haversine(lonlat1, lonlat2):

    lat1, lon1 = lonlat1
    lat2, lon2 = lonlat2
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

I'd greatly appreciate any help on how I could potentially compute this

rwp · Answer 1 · 2018-06-07T19:24:33.847

How about using the numpy.argwhere() function? This can be used to find the row/column index of entries in a matrix that satisfy a certain criterion. In your case, this might be something like:

numpy.argwhere(distances < 25)

That will return an Nx2 array giving row/column indices of the N points which are separated by a distance of less than 25m. Obviously, you may need some additional filtering to avoid the duplication associated with your symmetric distance matrix, and eliminating the zero distance between each point and itself. Perhaps you could handle that by taking your distance matrix and performing the following transformation:

distances = distances + 26 * (numpy.tril(numpy.ones_like(distances), k=1))

which forces all elements on the diagonal and in the upper triangle of your distance matrix to have a value above your 25m threshold.

In your particular case, because you're implicitly using a for-loop over all pairs of points to compute your distance matrix, you may not gain much overall from using the numpy routines to do the final stages of filtering. Instead, you might consider something more like:

indices = [ (i, j) for i in range(Npoints)
                     for j in range(i+1, Npoints)
                     if haversine(points[i], points[j]) < minDistance ]

That approach, while not as efficient as a pure-numpy method, should be less memory hungry when applied to large numbers of points.

A hybrid approach might involve taking subsets of your list of points, forming distance matrices between each pair of subsets, and filtering them using the above numpy.argwhere() method.

hi @rwp, thank you for your comment, unfortunately I get a memory error — M3105, Jun 07 '18 at 18:36
Does that mean that your distance matrix only just fitted in memory, so that the additional working storage needed for filtering is too great? Can you confirm that the basic idea is correct for a subset of your data? — rwp, Jun 07 '18 at 18:56

how do I return all points pairs that satisfy a distance threshold criteria using haversine distance matrix

1 Answers1