0

I have two data frame. One is user id with lat lon data and other is store code with store lat lon data. Around 89M rows are there. I want nearest (based on min.distance) store code corresponding user lat lon.

df1 - 

id          user_lat       user_lon
1           13.031885      80.235574
2           19.099819      72.915288
3           22.226980      84.836070

df2 - 

store_no       s_lat        s_lon
22             29.91         73.88
23             28.57         77.33
24             26.86         80.95

I have done so far -

from geopy.distance import vincenty
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

df1 = df1[['user_lat','user_lon']]

df2 = df2[['s_lat','s_lon']]

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

x.head(20)

    user_lat    user_lon    s_lat   s_lon
0   13.031885   80.235574   29.91   73.88
1   13.031885   80.235574   28.57   77.33
2   13.031885   80.235574   26.86   80.95
3   19.099819   72.915288   29.91   73.88
4   19.099819   72.915288   28.57   77.33
5   19.099819   72.915288   26.86   80.95
6   22.226980   84.836070   29.91   73.88
7   22.226980   84.836070   28.57   77.33
8   22.226980   84.836070   26.86   80.95

x['dist'] = np.ravel(dist.pairwise(np.radians(store_lat_lon),np.radians(user_lat_lon)) * 6367)

   user_lat     user_lon    s_lat   s_lon    dist
0   13.031885   80.235574   29.91   73.88   1986.237557
1   13.031885   80.235574   28.57   77.33   1205.217610
2   13.031885   80.235574   26.86   80.95   1386.069611
3   19.099819   72.915288   29.91   73.88   1752.628427
4   19.099819   72.915288   28.57   77.33   1143.731258
5   19.099819   72.915288   26.86   80.95   1031.246453
6   22.226980   84.836070   29.91   73.88   1538.449674
7   22.226980   84.836070   28.57   77.33   1190.620278
8   22.226980   84.836070   26.86   80.95   647.477461

But I want data frame looks like -

    user_lat    user_lon    s_lat   s_lon    dist         store_no
0   13.031885   80.235574   29.91   73.88   1986.237557     23
1   13.031885   80.235574   28.57   77.33   1205.217610     23
2   13.031885   80.235574   26.86   80.95   1386.069611     23
3   19.099819   72.915288   29.91   73.88   1752.628427     24
4   19.099819   72.915288   28.57   77.33   1143.731258     24
5   19.099819   72.915288   26.86   80.95   1031.246453     24
6   22.226980   84.836070   29.91   73.88   1538.449674     24
7   22.226980   84.836070   28.57   77.33   1190.620278     24
8   22.226980   84.836070   26.86   80.95   647.477461      24
Pallavi Verma
  • 85
  • 3
  • 12

1 Answers1

2

Finding the nearest store of each user is a classic use case for either the k-d tree or ball tree data structures. Scikit-learn implements both, but only the BallTree accepts the haversine distance metric, so we'll use that.

import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric

# Set up example data
df1 = pd.DataFrame({'id': [1, 2, 3],
                    'user_lat': [13.031885, 19.099819, 22.22698],
                    'user_lon': [80.235574, 72.915288, 84.83607]})

df2 = pd.DataFrame({'store_no': [22, 23, 24],
                    's_lat': [29.91, 28.57, 26.86],
                    's_lon': [73.88, 77.33, 80.95]})

# Build k-d tree with haversine distance metric, which expects
# (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(df2[['s_lat', 's_lon']]), metric=dist)

coords = np.radians(df1[['user_lat', 'user_lon']])
dists, ilocs = tree.query(coords)
# dists is in rad; convert to km
df1['dist'] = dists.flatten() * 6367
df1['nearest_store'] = df2.iloc[ilocs.flatten()]['store_no'].values

# Result:
df1
   id   user_lat   user_lon         dist  nearest_store
0   1  13.031885  80.235574  5061.416309             23
1   2  19.099819  72.915288  8248.857621             24
2   3  22.226980  84.836070  7483.628300             23
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • Thanks for your reply. But I have 90 M records..if i used for loop may be python will be crash...just alternative suggestion instead of for loop – Pallavi Verma Sep 04 '19 at 03:13
  • Oops, somehow I was in for-loop mode. Edited! – Peter Leimbigler Sep 04 '19 at 03:22
  • Thanks for your quick reply...but I also wants the distance value (in km) – Pallavi Verma Sep 04 '19 at 03:36
  • You're welcome. I'm away from a computer now and can't test, but I *think* I've edited the code to provide the distances themselves. (BallTree.query returns distances as well as indexes of results.) Will check & finish editing in about 8 h – Peter Leimbigler Sep 04 '19 at 03:44
  • Edited; curious to know if it processes 90 million rows on your machine! – Peter Leimbigler Sep 04 '19 at 12:20
  • @PeterLeimbigler Your answer is exceptionally good..But I have same doubt when you run this code on 90 millions data.mainly the cross join.Because I am also facing similar issues.I have almost 80 millions records but I also have city name. So instead of cross join with all combination if we can perform in city wise it’s help us lot and I am facing problems how to iteration only city wise not with every rows. Any help city wise? – Nikita Agarwal Sep 08 '19 at 14:48
  • @NikitaAgarwal, my approach to that problem would probably be the same as here. If you post it as a new question, let me know! – Peter Leimbigler Sep 09 '19 at 02:49
  • @PeterLeimbigler Sure...I will create a separate post and send you the link...many many thanks for you help – Pallavi Verma Sep 09 '19 at 11:45
  • @PeterLeimbigler In the final results df1 shows distance 8248.857621 or 7483.628300. I think there is some problem in this calculation.Can you please check once.Is it possible to show s_lat and s_lon also i.e. final data frame consist of id,user_lat, user_lon,s_lat, s_lon,dist,nearest_store.Please help me. Thanks in advance. – Pallavi Verma Sep 15 '19 at 13:45