I have a pandas dataframe after reading in a .csv file that resembles:
import itertools as it
import pandas as pd
import numpy as np
import scipy as sp
x = np.random.randn(5)
y = np.sin(x)
z = np.sin(x)+1
df = pd.DataFrame({'x':x, 'y':y, 'z':z})
df =
x y z
0 0.233070 0.230965 1.230965
1 -1.956269 -0.926621 0.073379
2 -0.015575 -0.015575 0.984425
3 -0.106887 -0.106684 0.893316
4 -0.510168 -0.488324 0.511676
I would like to compute pairwise euclidean distances using itertools.combinations and scipy.spatial.distance.euclidean and store these values either by extending the df or as a new dataframe. For example, extending the df would resemble this (x.xxxxxxx are of course the values that need to be calculated):
df =
x y z x-y x-z x-z
0 0.233070 0.230965 1.230965 x.xxxxxx x.xxxxxx x.xxxxxx
1 -1.956269 -0.926621 0.073379 x.xxxxxx x.xxxxxx x.xxxxxx
2 -0.015575 -0.015575 0.984425 x.xxxxxx x.xxxxxx x.xxxxxx
3 -0.106887 -0.106684 0.893316 x.xxxxxx x.xxxxxx x.xxxxxx
4 -0.510168 -0.488324 0.511676 x.xxxxxx x.xxxxxx x.xxxxxx
The actual dataset I'm working with is large so I'd like to figure an efficient pythonic way of dealing with this. I only need unique pairwise comparisons, so I'd like to avoid the the n-way comparisons that itertools.combinations includes (i.e., here this would be x-y-z), as well as avoid repetitions (e.g., y-x, z-x, z-y). Hope this is clear, thanks for any assistance.