0

I am trying to calculate distances between around 1.4 Millions points stored in a Dataframe with LATITUDE and LONGITUDE in separate columns and a small set of hotspots (around 10 points), also with latitude and longitude, but it is taking a while.

I try the code below:

from geopy import distance
import pandas as pd    

def distance_rows(start, points):
        return min([distance(start, stop).km for stop in points])

df['distance'] = df.apply(lambda row: distance_rows((row.LATITUDE, row.LONGITUDE), points), axis=1)

df is my dataframe, distance_rows receive the point and a list of around 10 points and return the minimum distance from start to each point in points.

It is taking a while to complete do you know a fast way to calculated the distance between millions of points and some points of interest using python?

I apologize for my English...

  • 1
    Is the goal to computer 14 million distances? Or is it to find which of the 10 points is closest to each of the 1.4 M points? The latter can be done faster. – Rick James Feb 21 '20 at 23:16
  • the goal is to find which of the 10 points is closest to each of the 1.4M points, then the result must be an array of size 1.4M containing the minimum distance for each point. – Jaison González Feb 22 '20 at 16:38
  • This is a 1-time task? It is OK if it takes _hours_ to finish? Then write some brute-force code. If you need a distance function, there are a lot of questions that provide the "haversine" function. – Rick James Feb 22 '20 at 18:49
  • Thanks! And sorry for the delay! I am hoping it to be near real time... every 15 minutes :D – Jaison González Feb 24 '20 at 21:52
  • 1
    I have routines that come close to realtime -- without pre-computing (as your Question presumed). http://mysql.rjweb.org/doc.php/find_nearest_in_mysql -- Some techniques are usually a few _milliseconds_, rarely over 20ms. Good enough? – Rick James Feb 24 '20 at 23:12
  • Tks! thats the solution! – Jaison González Feb 25 '20 at 00:25
  • I did not rush to mention that because all the solutions there are based on techniques in MySQL. Are you using a database? – Rick James Feb 25 '20 at 00:36
  • Yeah for me is an accepted solution to use SQL, in my case Teradata. I use one of your methods and it worked! 2 mins is the total duration of all the flow. I would prefer python, but the SQL is a smart solution. Tks for your help! – Jaison González Feb 25 '20 at 18:34
  • I'm curious -- which method did you pick? And, was it hard to convert to Teradata? – Rick James Feb 25 '20 at 20:43

0 Answers0