I am comparing the Jaccard distance matrix I get when I process a dataset using pdist
and a DIY Jaccard distance matrix function. I'm getting different results in my output distance matrices and I'm not sure why.
I think one of the following is the cause:
- My implementation of jaccard distance calculation is wrong
scipy.spatial.distance.pdist
(metric = 'jaccard')
andscipy.spatial.distance.jaccard
calculate jaccard distance in different ways (seems unlikely as their both inscipy.spatial.distance
)squareform
is doing something to my data, potentially a normalisation
The docs for squareform go a bit over my head so some form of normalisation might be what's happening. However, the squareform-ed distance matrix does not have the same relative distance magnitudes between cells which is confusing (e.g. row 0 in my DIY distance matrix is 0, 0.571429, 1
, and with pdist
is 0, 1, 1
- the middle value is twice as high with pdist
).
Can anyone explain the why I'm getting a different distance matrix when it's being analysed with the same metric?
My code:
import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
#I don't care about every value in the array for my use case, so dont want to include them in my comparison
all_features = set([i for i in feature_list1 if i != filler_val])
all_features.update(set([i for i in feature_list2 if i != filler_val]))
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
data_array = np.array([[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[8, 9, 10, 11, 12]])
# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])
# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))
Input array:
1 2 3 4 5
3 4 5 6 7
8 9 10 11 12
dist_diy
:
0 0.571429 1
0.571429 0 1
1 1 0
dist_pdist
:
0 1 1
1 0 1
1 1 0