5

I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I'm trying to return the minimum distance between the coordinates in each individual row compared to every coordinate in the list. The following code works on a small scale (dataframe with 200 rows), but when calculating over 2.7MM rows, it seemingly runs forever.

from haversine import haversine

df
Latitude   Longitude
39.989    -89.980
39.923    -89.901
39.990    -89.987
39.884    -89.943
39.030    -89.931

end_coords_list = [(41.342,-90.423),(40.349,-91.394),(38.928,-89.323)]

for row in df.itertuples():
    def min_distance(row):
        beg_coord = (row.Latitude, row.Longitude)
        return min(haversine(beg_coord, end_coord) for end_coord in end_coords_list)
    df['Min_Distance'] = df.apply(min_distance, axis=1)

I know the issue lies in the sheer number of calculations that are happening (5.7MM * 2,000 = ~11.4BN), and the fact that running this many loops is incredibly inefficient.

Based on my research, it seems like a vectorized NumPy function might be a better approach, but I'm new to Python and NumPy so I'm not quite sure how to implement this in this particular situation.

Ideal Output:

df
Latitude   Longitude  Min_Distance
39.989    -89.980     3.7
39.923    -89.901     4.1
39.990    -89.987     4.2
39.884    -89.943     5.9
39.030    -89.931     3.1

Thanks in advance!

Divakar
  • 218,885
  • 19
  • 262
  • 358
Walt Reed
  • 1,336
  • 2
  • 17
  • 26
  • 1
    Tell us about this `harversine`. What inputs does it accept? True `vectorization` usually requires reducing the calculation down basic math that `numpy` handles in compiled code. We can't `vectorize` a blackbox. – hpaulj Jun 21 '17 at 16:49
  • `haversine` accepts two inputs: the "beginning" coordinates, and the "ending" coordinates and calculates the distance (in kilometers) between the two. – Walt Reed Jun 21 '17 at 16:56
  • Is that from [`here`](https://github.com/mapado/haversine/blob/master/haversine/__init__.py)? If so, please link up in the question. – Divakar Jun 21 '17 at 16:56
  • Just updated it. Let me know if this provides the clarity you were looking for. – Walt Reed Jun 21 '17 at 17:00
  • I can't give an answer right now. This is O(n^2). Use a loop using numba. You'll have to look it up or wait for an answer. – piRSquared Jun 21 '17 at 17:00
  • Still not clear from where have you picked up `haversine`. – Divakar Jun 21 '17 at 17:01
  • Not sure what you mean. It's a package I installed.. – Walt Reed Jun 21 '17 at 17:04
  • 2
    And we need info on the source code to that package. Again, posting to confirm if this is the link - https://github.com/mapado/haversine/blob/master/haversine/__init__.py? Don't assume that we would have installed all packages at our ends. – Divakar Jun 21 '17 at 17:06
  • My apologies, you're correct- that's the package I'm using. – Walt Reed Jun 21 '17 at 17:09
  • A quick glance at that package shows that it works with scalars, using `math.sin` etc. But it also looks like it would be easy to rewrite the calculation using `numpy.sin` etc. It would be a good `numpy` programming exercise to do that yourself. – hpaulj Jun 21 '17 at 17:31

1 Answers1

8

The haversine func in essence is :

# convert all latitudes/longitudes from decimal degrees to radians
lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))

# calculate haversine
lat = lat2 - lat1
lng = lng2 - lng1

d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))

Here's a vectorized method leveraging the powerful NumPy broadcasting and NumPy ufuncs to replace those math-module funcs so that we would operate on entire arrays in one go -

# Get array data; convert to radians to simulate 'map(radians,...)' part    
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)

# Get the differentiations
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]

# Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.
# Add into "sin(lat * 0.5) ** 2" part.
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 +  add0

# Get h and assign into dataframe
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance'] = h.min(1)

For further performance boost, we can make use of numexpr module to replace the transcendental funcs.


Runtime test and verification

Approaches -

def loopy_app(df, coords_list):
    for row in df.itertuples():
        df['Min_Distance1'] = df.apply(min_distance, axis=1)

def vectorized_app(df, coords_list):   
    coords_arr = np.deg2rad(coords_list)
    a = np.deg2rad(df.values)

    lat = coords_arr[:,0] - a[:,0,None]
    lng = coords_arr[:,1] - a[:,1,None]

    add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
    d = np.sin(lat * 0.5) ** 2 +  add0

    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    df['Min_Distance2'] = h.min(1)

Verification -

In [158]: df
Out[158]: 
   Latitude  Longitude
0    39.989    -89.980
1    39.923    -89.901
2    39.990    -89.987
3    39.884    -89.943
4    39.030    -89.931

In [159]: loopy_app(df, coords_list)

In [160]: vectorized_app(df, coords_list)

In [161]: df
Out[161]: 
   Latitude  Longitude  Min_Distance1  Min_Distance2
0    39.989    -89.980     126.637607     126.637607
1    39.923    -89.901     121.266241     121.266241
2    39.990    -89.987     126.037388     126.037388
3    39.884    -89.943     118.901195     118.901195
4    39.030    -89.931      53.765506      53.765506

Timings -

In [163]: df
Out[163]: 
   Latitude  Longitude
0    39.989    -89.980
1    39.923    -89.901
2    39.990    -89.987
3    39.884    -89.943
4    39.030    -89.931

In [164]: %timeit loopy_app(df, coords_list)
100 loops, best of 3: 2.41 ms per loop

In [165]: %timeit vectorized_app(df, coords_list)
10000 loops, best of 3: 96.8 µs per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • That is freaking brilliant. Thanks for demonstrating how to use NumPy with Pandas. I get a memory error when running on a very arge data frame. Do you think 'numexpr' could solve that? – Walt Reed Jun 21 '17 at 19:35
  • @WaltReed No, `numexpr` won't help there. Just divide the dataframe into chunks say grab `10000` rows at one time, process with the proposed code and assign into the output col and then the next `10000` rows, repeat and so on. – Divakar Jun 21 '17 at 19:40