2

I am performing a hierarchical clustering analysis in python. My variables are binary so I was wondering how to calculate the binary euclidean distance. According to the literature, it is possible to use this distance metric with this clustering technique.

Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.

I was using scipy.spatial.distance.pdist(X, metric='euclidean') but this function uses the euclidean distance for non-binary data.

Is there any python library to calculate distance matrices based on the binary euclidean distance metric?

taras
  • 6,566
  • 10
  • 39
  • 50
Jorge Rodriguez
  • 150
  • 2
  • 13

2 Answers2

1

The paper you referenced has a formula which is simply a faster way to computer the standard euclidean distance for binary data. In that case the scipy method will work fine. Is there a different distance you would like used, or is your data somehow formatted so that pdist() doesn't work on it natively?

Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37
  • I wanted to confirm whether this function is valid to use with binary data or not. Indeed, for me the formula referenced by the paper is no so clear to see that is a faster way to compute the standard formula. – Jorge Rodriguez Aug 16 '18 at 08:07
  • 1
    The validity depends on what kind of data it is (in terms of domain knowledge, not just whether it's binary or not) and what you're doing with it. The euclidean distance induces the same topology as most other useful metrics, so in some sense the worst thing that can happen is that you get the right answer plus a distortion. That's fine in some domains and not in others. As to the speed, all the paper is doing in that section is noting that for binary vectors v and w, |v-w| is the same as (v XOR w). If your data is stored bitwise, this can be really fast. – Hans Musgrave Aug 16 '18 at 13:45
  • Note that speed comment doesn't apply to, e.g., a list of floats which happen to only be 0 or 1. In Python, that carries the extra overhead of everything being an object. In most languages (Python included), that at least has the extra bits needed to represent the floats. To help you better, we really need an example of what you mean by "binary data" to be able to suggest which methods to use. – Hans Musgrave Aug 16 '18 at 13:47
0

Solution 1 - numpy

from numpy import linalg, array

M1 = [[1, 1], [0, 1]]
M2 = [[0, 1], [1, 1]]

print(linalg.norm(array(M1) - array(M2)))

Solution 2 - custom

M1 = [[1, 1], [0, 1]]
M2 = [[0, 1], [1, 1]]

def binary_dist(m1, m2):
    sum = 0
    for i in range(len(m1)):
        for j in range(len(m1[i])):
            if m1[i][j] != m2[i][j]:
                sum += 1
    return sum ** .5


print(binary_dist(M1, M2))
Omar Cusma Fait
  • 313
  • 1
  • 11