0

I have the following dataframe in pandas:

import pandas as pd

df = pd.DataFrame({
    "CityId": {
        "0": 0, 
        "1": 1, 
        "2": 2, 
        "3": 3, 
        "4": 4
    }, 
    "X": {
        "0": 316.83673906150904, 
        "1": 4377.40597216624, 
        "2": 3454.15819771172, 
        "3": 4688.099297634771, 
        "4": 1010.6969517482901
    }, 
    "elevation_meters": {
        "0": 1, 
        "1": 2, 
        "2": 3, 
        "3": 4, 
        "4": 5
    }, 
    "Y": {
        "0": 2202.34070733524, 
        "1": 336.602082171235, 
        "2": 2820.0530112481106, 
        "3": 2935.89805580997, 
        "4": 3236.75098902635
    }
})

I am trying to create a distance matrix that represents the cost of moving between each of these CityIds. Using pdist and squareform from scipy.spatial.distance I can do the following:

from scipy.spatial.distance import pdist, squareform

df_m = pd.DataFrame(
    squareform(
        pdist(
            df[['CityId', 'X', 'Y']].iloc[:, 1:],
            metric='euclidean')
    ),
    index=df.CityId.unique(),
    columns= df.CityId.unique()
)

This gives me a distance matrix between all the CityIds using pairwise distances calculated from pdist.

I would like to incorporate elevation_meters into the this distance matrix. What is an efficient way to do so?

ZeroStack
  • 1,049
  • 1
  • 13
  • 25

1 Answers1

2

You can try scipy.spatial.distance_matrix:

xx = df[['X','elevation_meters', 'Y']]
pd.DataFrame(distance_matrix(xx,xx), columns= df['CityId'],
             index=df['CityId'])

Output:

CityId  0               1                2              3               4
CityId                  
0       0.000000        4468.691544     3197.555070     4432.386687     1245.577226
1       4468.691544     0.000000        2649.512402     2617.799439     4443.602402
2       3197.555070     2649.512402     0.000000        1239.367465     2478.738402
3       4432.386687     2617.799439     1239.367465     0.000000        3689.688537
4       1245.577226     4443.602402     2478.738402     3689.688537     0.000000
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks, this seems to work. I'm still trying to understand `scipy.spatial.distance_matrix`, how does it differentiate between latitude/longitude and elevation? Generally, aren't `z` coordinates just a height in meters/kilometers? Why is `elevation_meters` positioned in the middle? – ZeroStack May 15 '19 at 00:59
  • in a nutshell, it just looks at every pair of rows and compute the distance `sqrt((x1-x2)**2 + (z1-z2)**2 + (y1-y2)**2)`. About why `elevation_meters` comes in the middle, I have no idea. Maybe you should ask the creator of your data. – Quang Hoang May 15 '19 at 01:04
  • In terms of the `elevation_meters`, I was referring to the positional placement in `scipy.spatial.distance_matrix` function, and whether it matters, especially after considering that latitude and longitude are represented as geographic coordinates and elevation_meters is in meters. – ZeroStack May 15 '19 at 01:53
  • The order doesn't matter as you can see in the formula. You can pass either `[X, elev, Y]` or `[X,Y,elev]` and still get the same answer. – Quang Hoang May 15 '19 at 01:56
  • In that case it seems that `squareform(pdist(df[['X','elevation_meters', 'Y']])) == distance_matrix(xx,xx)` – ZeroStack May 15 '19 at 02:01
  • That's what I was trying to say in my comments below your questions. – Quang Hoang May 15 '19 at 02:02
  • Thanks Quang, my ignorance, I did not realise it accepts an n-dimensional space. – ZeroStack May 15 '19 at 02:04