I have a dataframe, and I would like to apply my own distance pairwise. The problem it that myDistance takes 2 dataframes, and using skelarn pairwise_distance or scipy pdist converts is to ndarray. Example:
df = pd.DataFrame([[1,2,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D'])
This returns:
A B C D
0 1 2 3 3
1 2 3 3 4
2 4 1 3 2
Then:
def myDistance(f1,f2):
return f1['A']-f2['A']
myDistance(df.loc[0],df.loc[1])
This works and returns -1.
But this doesn't, because pdist treat the df row as ndarray
from scipy.spatial.distance import pdist
dist = pdist(df,myDistance)
IndexError: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices