Problem summary:
I have two dataframes. The first dataframe (df1) is relatively small (Nearly always less than 100 observations, typically less than 50), with a set of point identifiers and their lat/lon coordinates. The second dataframe (df2) is very large (hundreds of thousands of observations) which also has lat/lon coordinates. I wish to create two new columns in df2: the first has the identifier of the nearest point from df1, the second has the distance to that point. My current method is quite clunky and I think can be optimized significantly. For additional context, there is a single df1 (small dataframe), but I will be repeating this process for multiple df2s (large dataframes).
Setup/Sample Data:
# imports:
import pandas as pd
import geopy.distance
from faker import Faker
# creating sample data:
Faker.seed(0)
fake=Faker()
id1=[]
lat1=[]
lon1=[]
id2=[]
lat2=[]
lon2=[]
length1=10 # length of df1
length2=100 # length of df2
for x in range(length1):
a=fake.local_latlng()
id1.append(x)
lat1.append(float(a[0]))
lon1.append(float(a[1]))
for x in range(length2):
a=fake.local_latlng()
id2.append(x)
lat2.append(float(a[0]))
lon2.append(float(a[1]))
dict1={
'loc_id' : id1,
'lat' : lat1,
'lon' : lon1,
}
dict2={
'point_id' : id2,
'lat' : lat2,
'lon' : lon2,
}
df1=pd.DataFrame(dict1)
df2=pd.DataFrame(dict2)
Current Solution:
# calculating distances:
for x in range(len(df1)):
loc_id=df1.iloc[x]['loc_id']
pt1=(df1.iloc[x]['lat'],df1.iloc[x]['lon'])
for y in range(len(df2)):
pt2=(df2.iloc[y]['lat'],df2.iloc[y]['lon'])
dist=geopy.distance.distance(pt1,pt2).miles
df2.loc[y,x]=dist
# determining minimum distance and label:
temp_cols=list(range(len(df1)))
df2['min_dist']=df2[temp_cols].min(axis=1)
df2['min_loc']=df2[temp_cols].idxmin(axis=1)
# removing extra columns:
df2=df2.drop(temp_cols,axis=1)
print(df2.head())
Possible solutions:
This code is quite obviously slow, since I compute the distance for each pair of points. Conceptually, I think this can be improved, but I'm having trouble implementing the improvements. Some ideas:
- vectorize operations. This accepted answer seems to indicate that operations over vectors are faster, but I don't know how to implement the geopy.distance.distance() function over a vector (or if it's possible).
- eliminate points by comparing which are "dominated" so to speak. Such that if, for example, a point is larger on both lat/lon than another, I might be able to eliminate it when comparing to a point that is smaller in both lat/lon points from the set I have to check against. I imagine this increases work/processing on the front end, but pays off in the end by reducing the number of points I check for each point. Figuring out that algorithm is not obvious to me though.
- I might be able to do some sort of binning of points into groups that are nearby one another, therefore getting smaller sets of candidates to compare to one another. Maybe it's possible to figure out which is the closest point(s) before calculating distances. The danger is that some of the points in df1 might be quite close together as well.
Additional details: The odds of two points having identical distances is small, and I am happy randomly selecting any point that ties for closest if that should come up.