1

I have a dataframe, and I would like to apply my own distance pairwise. The problem it that myDistance takes 2 dataframes, and using skelarn pairwise_distance or scipy pdist converts is to ndarray. Example:

df = pd.DataFrame([[1,2,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])

This returns:

    A   B   C   D
0   1   2   3   3
1   2   3   3   4
2   4   1   3   2

Then:

def myDistance(f1,f2):
    return f1['A']-f2['A']

myDistance(df.loc[0],df.loc[1])

This works and returns -1.
But this doesn't, because pdist treat the df row as ndarray

from scipy.spatial.distance import pdist
dist = pdist(df,myDistance)

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

matlabit
  • 838
  • 2
  • 13
  • 31

1 Answers1

0

I'm think I understand you problem. You want to basically calculate the pairwise distances on only the A column of your dataframe. In that case, assuming column A is the first column on both dataframes, then you want to change your custom function to:

def myDistance(u, v):
    return((u - v)[0])  # get the 0th index, which corresponds to column A

Now run:

dist = pdist(df, myDistance)

Result:

array([-1., -3., -2.])
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51