I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I'm trying to return the minimum distance between the coordinates in each individual row compared to every coordinate in the list. The following code works on a small scale (dataframe with 200 rows), but when calculating over 2.7MM rows, it seemingly runs forever.
from haversine import haversine
df
Latitude Longitude
39.989 -89.980
39.923 -89.901
39.990 -89.987
39.884 -89.943
39.030 -89.931
end_coords_list = [(41.342,-90.423),(40.349,-91.394),(38.928,-89.323)]
for row in df.itertuples():
def min_distance(row):
beg_coord = (row.Latitude, row.Longitude)
return min(haversine(beg_coord, end_coord) for end_coord in end_coords_list)
df['Min_Distance'] = df.apply(min_distance, axis=1)
I know the issue lies in the sheer number of calculations that are happening (5.7MM * 2,000 = ~11.4BN), and the fact that running this many loops is incredibly inefficient.
Based on my research, it seems like a vectorized NumPy function might be a better approach, but I'm new to Python and NumPy so I'm not quite sure how to implement this in this particular situation.
Ideal Output:
df
Latitude Longitude Min_Distance
39.989 -89.980 3.7
39.923 -89.901 4.1
39.990 -89.987 4.2
39.884 -89.943 5.9
39.030 -89.931 3.1
Thanks in advance!