2

I want to generate a distance matrix 500X500 based on latitude and longitude of 500 locations, using Haversine formula.

Here is the sample data "coordinate.csv" for 10 locations:

Name,Latitude,Longitude
depot1,35.492807,139.6681689
depot2,33.6625572,130.4096027
depot3,35.6159881,139.7805445
customer1,35.622632,139.732631
customer2,35.857287,139.821461
customer3,35.955313,139.615387
customer4,35.16073,136.926239
customer5,36.118163,139.509548
customer6,35.937351,139.909783
customer7,35.949508,139.676462

After getting the distance matrix, I want to find the closest depot to each customer based on the distance matrix, and then save the output (Distance from each customer to the closet depot & Name of the closest depot) to Pandas DataFrame.

Expected outputs:

// Distance matrix
[ [..],[..],[..],[..],[..],[..],[..],[..],[..],[..] ]

// Closet depot to each customer (just an example)
Name,Latitude,Longitude,Distance_to_closest_depot,Closest_depot
depot1,35.492807,139.6681689,,
depot2,33.6625572,130.4096027,,
depot3,35.6159881,139.7805445,,
customer1,35.622632,139.732631,10,depot1
customer2,35.857287,139.821461,20,depot3
customer3,35.955313,139.615387,15,depot2
customer4,35.16073,136.926239,12,depot3
customer5,36.118163,139.509548,25,depot1
customer6,35.937351,139.909783,22,depot2
customer7,35.949508,139.676462,15,depot1
Prune
  • 76,765
  • 14
  • 60
  • 81
belle
  • 109
  • 2
  • 11
  • So you have explained what you want to do. Please note that Stack Overflow is not a blog for what people plan to do. If however you have a question, then please consult [How to ask a good question](https://stackoverflow.com/help/how-to-ask), because currently there is no question. Don't forget to post the code you have, specifying where exactly you are stuck. – trincot Oct 09 '19 at 15:23

2 Answers2

2

There are a couple of library functions that can help you with this:

  • cdist from scipy can be used to generate a distance matrix using whichever distance metric you like.
  • There is also a haversine function which you can pass to cdist.

After that it's just a case of finding the row-wise minimums from the distance matrix and adding them to your DataFrame. Full code below:

import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine


df = pd.read_clipboard(sep=',')
df.set_index('Name', inplace=True)
customers = df[df.index.str.startswith('customer')]
depots = df[df.index.str.startswith('depot')]

dm = cdist(customers, depots, metric=haversine)
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)

customers['Closest Depot'] = depots.index[closest]
customers['Distance'] = distances

Results:

            Latitude   Longitude Closest Depot    Distance
Name                                                      
customer1  35.622632  139.732631        depot3    4.393506
customer2  35.857287  139.821461        depot3   27.084212
customer3  35.955313  139.615387        depot3   40.565820
customer4  35.160730  136.926239        depot1  251.466152
customer5  36.118163  139.509548        depot3   60.945377
customer6  35.937351  139.909783        depot3   37.587862
customer7  35.949508  139.676462        depot3   38.255776

As per comment, I have created an alternative solution which instead uses a square distance matrix. The original solution is better in my opinion, as the question stated that we only want to find the closest depot for each customer, so calculating distances between customers and between depots isn't necessary. However, if you need the square distance matrix for some other purpose, here is how you would create it:

import pandas as pd
import numpy as np
from scipy.spatial.distance import squareform, pdist
from haversine import haversine


df = pd.read_clipboard(sep=',')
df.set_index('Name', inplace=True)

dm = pd.DataFrame(squareform(pdist(df, metric=haversine)), index=df.index, columns=df.index)
np.fill_diagonal(dm.values, np.inf)  # Makes it easier to find minimums

customers = df[df.index.str.startswith('customer')]
depots = df[df.index.str.startswith('depot')]
customers['Closest Depot'] = dm.loc[depots.index, customers.index].idxmin()
customers['Distance'] = dm.loc[depots.index, customers.index].min()

The final results are the same as before, except you now have a square distance matrix. You can put the 0s back on the diagonal after you have extracted the minimum values if you like:

np.fill_diagonal(dm.values, 0)
sjw
  • 6,213
  • 2
  • 24
  • 39
  • Thanks for your answer. Can I use **pd.read_csv("coordinate.csv")** instead of pd.read_clipboard? – belle Oct 10 '19 at 01:58
  • Yes, `pd.read_clipboard` was just my way of reading your data into the DataFrame, but `pd.read_csv("coordinate.csv")` should work for you. – sjw Oct 10 '19 at 08:00
  • I want dm to be a square distance matrix format `dm = [ [ ],[ ],...]` with 0-value diagonal, because distance from A to A is 0. How can I do that and implement it with your other codes to make the same output? – belle Oct 10 '19 at 11:30
  • @belle - I have edited my answer to give a second approach. – sjw Oct 10 '19 at 13:12
  • Sorry, because my distance matrix is an array of arrays (600x600) from google maps distance matrix api, so I want to know how to get the output by reading data from that array. Could you please help with this once again? Really appreciate your help. – belle Oct 10 '19 at 14:28
  • Here is the link where I got the distance matrix: https://developers.google.com/optimization/routing/vrp . With this, how can I get the output that I want? Or if you know other algorithm besides Haversine to get the real-world distance between 2 locations, that would be great. – belle Oct 10 '19 at 14:45
0

If you need a very big matrix and have access to a NVIDIA GPU with CUDA you can use this numba function:

from numba import cuda
import math

@cuda.jit
def haversine_gpu_distance_matrix(p, G):
  i,j = cuda.grid(2)
  if i < p.shape[0] == G.shape[0] and j < p.shape[0] == G.shape[1]:
    if i == j:
      G[i][j] = 0
    else:
      longit_a = math.radians(p[i][0])
      latit_a = math.radians(p[i][1])
      longit_b = math.radians(p[j][0])
      latit_b =  math.radians(p[j][1])
      dist_longit_add = longit_b - longit_a
      dist_latit_sub = latit_b - latit_a
      dist_latit_add = latit_b + latit_a
      pre_comp = math.sin(dist_latit_sub/2)**2
      area = pre_comp + ((1 - pre_comp - math.sin(dist_latit_add/2)**2) * math.sin(dist_longit_add/2)**2)
      central_angle = 2 * math.asin(math.sqrt(area))
      radius = 3958
      G[i][j] = math.fabs(central_angle * radius)

You can call this function using the following commands:

# 10k [lon, lat] elements, replace this with your [lon, lat] array
# if your data is in a Pandas DataFrame, please convert it to a numpy array
geo_array = np.ones((10000, 2)) 
# allocate an empty distance matrix to fill when the function is called
dm_global_mem = cuda.device_array((geo_array.shape[0], geo_array.shape[0]))
# move the data in geo_array onto the GPU
geo_array_global_mem = cuda.to_device(geo_array)

# specify kernel dimensions, this can/should be further optimized for your hardware
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(geo_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(geo_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run the function, which will transform dm_global_mem inplace
haversine_gpu_distance_matrix[blockspergrid, threadsperblock](geo_array_global_mem, dm_global_mem)

Note that this can be further optimized for your hardware. The runtime on a g4dn.xlarge instance on 10k geographical coordinate pairs (i.e. 100M distance measurements) is less than 0.01 seconds after compiling. The radius value is set such that the distance matrix is in unit miles, you can change it to 6371 for meters.

Ryan Walden
  • 167
  • 10