I have two separate datasets,df
and df2
, each dataset has the columns longitude
and latitude
. What I am trying to do is find the point in df
that is closest to the point in df2
, and calculate the distance between them in km
and append each value to a new column in df2
.
I've come up with a solution, but keep in mind df
has +700,000
rows and df2
has about 60,000
rows, so my solution will take way too long to compute. The only solution I could come up with is using a double for
loop...
def compute_shortest_dist(df, df2):
# array to store all closest distances
shortest_dist = []
# radius of earth (used for calculation)
R = 6373.0
for i in df2.index:
# keeps track of current minimum distance
min_dist = -1
# latitude and longitude from df2
lat1 = df2.ix[i]['Latitude']
lon1 = df2.ix[i]['Longitude']
for j in df.index:
# the following is just the calculation necessary
# to calculate the distance between each point in km
lat2 = df.ix[j]['Latitude']
lon2 = df.ix[j]['Longitude']
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
# store new shortest distance
if min_dist == -1 or distance > min_dist:
min_dist = distance
# append shortest distance to array
shortest_dist.append(min_dist)
This function takes way too long to compute and I know there must be a more efficient way but I am not very good at pandas
syntax.
I appreciate any help.