4

I have two separate datasets,df and df2, each dataset has the columns longitude and latitude. What I am trying to do is find the point in df that is closest to the point in df2, and calculate the distance between them in km and append each value to a new column in df2.

I've come up with a solution, but keep in mind df has +700,000 rows and df2 has about 60,000 rows, so my solution will take way too long to compute. The only solution I could come up with is using a double for loop...

def compute_shortest_dist(df, df2):
    # array to store all closest distances
    shortest_dist = []

    # radius of earth (used for calculation)
    R = 6373.0
    for i in df2.index:
        # keeps track of current minimum distance
        min_dist = -1

        # latitude and longitude from df2
        lat1 = df2.ix[i]['Latitude']
        lon1 = df2.ix[i]['Longitude']

        for j in df.index:

            # the following is just the calculation necessary
            # to calculate the distance between each point in km
            lat2 = df.ix[j]['Latitude']
            lon2 = df.ix[j]['Longitude']
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
            c = 2 * atan2(sqrt(a), sqrt(1 - a))
            distance = R * c

            # store new shortest distance
            if min_dist == -1 or distance > min_dist:
                min_dist = distance
        # append shortest distance to array
        shortest_dist.append(min_dist)

This function takes way too long to compute and I know there must be a more efficient way but I am not very good at pandas syntax.

I appreciate any help.

buydadip
  • 8,890
  • 22
  • 79
  • 154
  • well it has to loop 42000 million times so I don't think there's a really efficient way to calculate that. You could try to use a module to parallelize the process to make it twice, four times as fast, but I can't think of a better way – Shinra tensei Feb 09 '18 at 08:50
  • The calculation of `numpy.ndarray` should be faster than that of `pandas.core.series.Series`. You may try adding `.values` when you extract the latitude and longitude. – pe-perry Feb 09 '18 at 08:57
  • Possibly related: https://stackoverflow.com/q/7129482/1679849 – r3mainer Feb 09 '18 at 09:52

1 Answers1

2

You can write the inner loop in numpy, which should speed it up significantly:

import numpy as np

def compute_shortest_dist(df, df2):
    # array to store all closest distances
    shortest_dist = []

    # radius of earth (used for calculation)
    R = 6373.0
    lat1 = df['Latitude']
    lon1 = df['Longitude']
    for i in df2.index:
        # the following is just the calculation necessary
        # to calculate the distance between each point in km
        lat2 = df2.loc[i, 'Latitude']
        dlat = lat1 - lat2
        dlon = lon1 - df2.loc[i, 'Longitude']
        a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
        distance = 2* R * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

        # append shortest distance to array
        shortest_dist.append(distance.min())
    return shortest_dist
Graipher
  • 6,891
  • 27
  • 47
  • 1
    dont use `ix` because [deprecated](http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated), better is `df2.loc[i, 'Longitude']` – jezrael Feb 09 '18 at 09:11