I am trying to measure the distance between points inside a pandas dataframe. I first and looking to measure the distance between points that are in a sub region and get the average distance for that group. Then I want to measure the distance between the subregions (measuring the distance between those two vectors). I understand how to do the measuring part (using scipy.spatial.distance.euclidean
for the former and scipy.spatial.distance.cdist
for the latter). The issue I am running across is figuring out how to apply the functions to the dataset. I think I should use groupby.apply() and feed in my function, but I'm having trouble conceptualizing that. The dataframe looks like this:
id, latitude, longitude, subregion, region
Currently I have:
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean
df = pd.read_csv('targets.csv')
...
def calculate_distance(x,y):
return x._get_numeric_data().apply(axis=0, func=euclidean[x,y]).mean()
df.groupby('subregion').apply(calculate_distance)
I know this is incorrect as I want to apply to multiple columns for all the rows. My other thought is that I am using the wrong data structure for this.