Find the distance between 2 series of points in Pandas, Fastest Iteration

Question

Have 2 sets of data, 1 which contains coordinates of fixed location called locations

And a secondary table of vehicle movements called movements

What would be the fastest way to iterate through both tables to find if any of the movements are within a certain distance of a location, e.g. the Euclidean distance between a point on the movements and a point on any of the locations?

Currently am using a nested loop which is incredibly slow. Both pandas df have been converted using

locations_dict=locations.to_dict('records')
movements_dict=movements.to_dict('records')

then iterated via:

for movement in movements_dict:
    visit='no visit'
    for location in locations_dict:
        distance = np.sqrt((location['Latitude']-movement['Lat'])**2+(location['Longitude']-movement['Lng'])**2)
        if distance < 0.05:
            visit=location['Location']
            break
        else:
            continue
    movement['distance']=distance
    movement['visit']=visit

Any way to make this faster? The main issue is this operation is a cartesian product, and any inserts will increase the complexity of the operation significantly.

You cant measure distances on a sphere like that. Create a [geodataframe](https://geopandas.org/en/stable/docs/reference/api/geopandas.points_from_xy.html), reproject it to different coordinate systems suitable for the location, for example different UTM zones — BERA, Sep 11 '22 at 08:27
A good example to see that this method produce bogus results is to compute the distance between Tokio and Rio de Janeiro. The vector will completely cross the earth and be about 1.57 times shorter (PI/2) than the real-world shortest possible path (geodesic). Even worse: in antartica of in the pacific, the missing modulus can give completely unrealistic distances (like the 2x the distance of a Tokio-Rio while it is 1000 times less than that). Earth is not flat so naive euclidean distances do not work ;) . — Jérôme Richard, Sep 11 '22 at 09:53

Claudio · Answer 1 · 2022-09-11T02:26:33.867

You can export the pandas data directly to numpy for example like this:

loc_lat=locations['Latitude' ].to_numpy()
loc_lon=locations['Longitude'].to_numpy()
mov_lat=movements['Lat'      ].to_numpy()
mov_lon=movements['Lon'      ].to_numpy()

From now on there is no need to use loops to obtain results as you can rely on numpy working an entire arrays at once. This should give a great speedup over the approach using Python looping over dictionary values.

Check out following code example showing how to get an array with all pairs from two arrays:

import numpy as np
a = np.array([1,2,3])
b = np.array([4,5])
print( np.transpose([np.tile(a, len(b)), np.repeat(b,len(a))]) )
gives_as_print = """
[[1 4]
 [2 4]
 [3 4]
 [1 5]
 [2 5]
 [3 5]]"""

Find the distance between 2 series of points in Pandas, Fastest Iteration

1 Answers1