1
import scipy.spatial.distance as dist

Y=[[1,2,3],[2,3,4]]

Q=dist.pdist(Y,'jaccard')

print Q

The following snippet gives jaccard distance as 1 while it should be 0.5. On the other hand if Y=[[1,2,3],[4,2,3]] i.e if ordering is changed output is 0.33. But jaccard distance is independent of order of elements. Can you suggest how to resolve this issue here?

Harsh
  • 75
  • 5
  • The [docs](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.pdist.html) aren't very clear, but they suggest that the ordering is important: they say that the Jaccard distance is "the proportion of those elements u[i] and v[i] that disagree", which I understand is for fixed i for both elements. That would agree with your results. Anyhow, did you check the implementation in their source code? – phfaist Feb 20 '16 at 14:16
  • 3
    The docstring for the `jaccard` function (http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html) gives a better description. `jaccard` computes the Jaccard-Needham dissimilarity for *boolean* arrays. Its behavior for other array types is not defined, so you shouldn't be passing in arrays of arbitrary integers. – Warren Weckesser Feb 20 '16 at 14:41

2 Answers2

1

The docstring for the jaccard function gives a better description of the calculation than the terse summary in the pdist docstring. jaccard computes the Jaccard-Needham dissimilarity for boolean arrays. Its behavior for other array types is not defined, so you shouldn't be passing in arrays of arbitrary integers.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
0

For anyone else with this issue, pdist appears to compare arrays by index rather than just what objects are present - so the scipy implementation is order dependent, but the input arrays are not treated as boolean arrays (in the sense that [1,2,3] and [4,5,6] are not both treated as [True True True], unlike the scipy jaccard function).

I had a similar issue and looked at it here:
Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
Tim Kirkwood
  • 598
  • 2
  • 7
  • 18