Find nearest point in other dataframe (WITH A LOT OF DATA)

Question

The problem is simple i have two DataFrame :

one with 90 000 apartment and their latitude/longitude
and one with 3 000 pharmacy and their latitude/longitude

And i want to create a new variable for all my apartments : 'distance of the nearest pharmacy'

For this i tried two methods which spend to much time:

First method : I created a matrix with my apartments in row and my pharmacies in columns and the distance betwenn them in the intersection, after that i just take the min of the matrix to have a column vector of 90 000 value

I just use a double for with numpy :

m,n=len(result['latitude']),len(pharma['lat'])
M = np.ones((m,n))
for i in range(m):
     for j in range(n):
        if (result['Code departement'][i]==pharma['departement'][j]):
            M[i,j] =(pharma['lat'][j]-result['latitude'][i])**2+(pharma['lng'][j]-result['longitude'] [i])**2

ps : i know that the wrong formula for lat/long but the appartments are in the same region so it's a good aproximation

Second method : I use the solution from this topics (who are the same problem but with less data) https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe

I used geopandas et the nearest method :

from shapely.ops import nearest_points
pts3 = pharma.geometry.unary_union


def near(point, pts=pts3):
     nearest = pharma.geometry == nearest_points(point, pts)[1]
     return pharma[nearest].geometry.get_values()[0]

appart['Nearest'] = appart.apply(lambda row: near(row.geometry), axis=1)

And as i said, both methods spend too much time, after 1 hour of running my pc/notebook crashed and it failed.

My final question : have you a optimized method to go faster ? it is possible ? If it's already optimized, i will buy a other PC but which criteria but what criteria to look for to have a PC capable of doing such a quick calculation ?

I think you should be following the second answer to the question you refer us to, ie the one using a spatial index to avoid an all:all calculation of distances. — High Performance Mark, Nov 16 '19 at 18:41
Have you a example ? Because i have the impression that i already use a spatial index in the second solution with geopandas and this changed nothing for the time spent. — Arnaud Hureaux, Nov 16 '19 at 23:28
Then I have misunderstood your code and my earlier comment was mistaken. — High Performance Mark, Nov 17 '19 at 09:01
Just to clarify, the second option based on shapely does not use spatial index. — martinfleis, Nov 17 '19 at 12:22
No no it's certainly me that don't understood what are a spatial index. Have you one example ? one link ? — Arnaud Hureaux, Nov 17 '19 at 15:08
You can start here https://geoffboeing.com/2016/10/r-tree-spatial-index-python/ but keep in mind that this is for intersection. I have implemented similar stuff here https://docs.momepy.org/en/stable/_modules/momepy/elements.html#get_network_id . Hope it helps. — martinfleis, Nov 17 '19 at 16:33

mgc · Accepted Answer · 2019-11-19T21:11:45.260

11

I guess a Ball Tree is an appropriate structure for this task.

You can use the scikit-learn implementation, see the code below for an example adapted to your case :

import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import BallTree

## Create the two GeoDataFrame to replicate your dataset
appart = gpd.GeoDataFrame({
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(100000), np.random.rand(100000))
])

pharma = gpd.GeoDataFrame([{
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(3000), np.random.rand(3000))
])

# Create a BallTree 
tree = BallTree(pharma[['x', 'y']].values, leaf_size=2)

# Query the BallTree on each feature from 'appart' to find the distance
# to the nearest 'pharma' and its id
appart['distance_nearest'], appart['id_nearest'] = tree.query(
    appart[['x', 'y']].values, # The input array for the query
    k=1, # The number of nearest neighbors
)

With this method you can solve your problem pretty quickly (the above example, on my computer, took less than a second to find the index of the nearest point, out of 3000 points, on an input dataset of 100000 points).

By default the query method of BallTree is returning the distance to the nearest neighbor and its id. If you want you can disable returning the distance of this nearest neighbor by setting the return_distance parameter to False. If you really only care about the distance, you can save only this value :

appart['distance_nearest'], _ = tree.query(appart[['x', 'y']].values, k=1)

edited Nov 19 '19 at 21:11

answered Nov 19 '19 at 20:40

mgc

5,223
1
24
37

1

Ooooh thanks man, its so efficiency and fast :o It's a very good news :) I was trying to make a segmentation by region to have less calculs and it worked too but less than BallTree **Last question :** how have a distance in kilometer with balltree on a latitude and longitude ? Because here i have a distance but i don't know chat it represent really (ans if it has sense). – Arnaud Hureaux Nov 21 '19 at 21:29
1

I think you should convert your `pharma` and your `appart` geodataframes to use a projected coordinate system (such as 'epsg:2154' for France or 'epsg:3035' for Europe) by doing `appart.to_crs(epsg=2154, inplace=True)` (same for `pharma`). Then you create x and y columns by doing `appart['x'] = appart.geometry.x` and `appart['y'] = appart.geometry.y` (same for `pharma`). Then you can use the ballTree as described in my answer, the returned distance will be in meters. – mgc Nov 21 '19 at 22:25
@ArnaudH don't hesitate to tell if you prefer that i expand my explanation in my answer instead than in comments ! – mgc Nov 21 '19 at 22:27
1

It's perfect in comment but as you want :) I just change the metrics to the balltree : `tree = BallTree(pharma[['lat_r', 'lng_r']].values, leaf_size=2, metric='haversine')` And i convert my degrees into radians with : `appart['latitude_r']=pd.DataFrame(np.deg2rad(appart['latitude'].values)) appart['longitude_r']=pd.DataFrame(np.deg2rad(appart['longitude'].values)) pharma['lat_r']=pd.DataFrame(np.deg2rad(pharma['lat'].values)) pharma['lng_r']=pd.DataFrame(np.deg2rad(pharma['lng'].values)) ` It's still not good but it's a little problem which will be solve in few time i think :) – Arnaud Hureaux Nov 21 '19 at 22:43
1

Oh they are a limitation of characters in comment, and it's true that i don't understant what is a coordinate system, if you have some documentation about this and why it will solve my problem i'am open ;) – Arnaud Hureaux Nov 21 '19 at 22:45
Great, i forgot about the 'haversine' metric ! Note that the output will also be in radians in this case. – mgc Nov 23 '19 at 09:32
Regarding the question about "coordinate systems" maybe you can start with an introduction like this: http://proceedings.esri.com/library/userconf/proc16/tech-workshops/tw_1746-46.pdf . Basically, the idea is to transform coordinates in latitude longitude (corresponding to the position on a global 3D spherical surface) to coordinates on a flat 2d surface. There is not "one way" to do this for the whole world without introducing deformations ; that's why local projections (such as [epsg:2154 for France](https://epsg.io/2154)) exists. – mgc Nov 23 '19 at 09:34

Find nearest point in other dataframe (WITH A LOT OF DATA)

1 Answers1

Linked