I have two lists l1
and l2
containing integers that may be of different lengths, and I want to perform a computation between every possible pairing between these two vectors.
Specifically, I'm checking the Hamming distance between each pair and if the distance is sufficiently small I want to "count" it.
Naively, this could be implemented
def hamming_distance(n1: int, n2: int) -> float:
return bin(n1 ^ n2).count('1')/32.0
matches = 0
for n1 in l1:
for n2 in l2:
sim = 1 - hamming_distance(n1, n2)
if sim >= threshold:
matches += 1
But this is not very fast.
I've unsuccessfully tried to leverage scipy.spatial.distance.cdist
, where I figured that I would first compute the Hamming distance between all pairs, as the scipy.spatial.cdist documentation states that it will
Compute distance between each pair of the two collections of inputs.
and then count the number of elements satisfying the predicate that 1 - d >= threshold
where d
is the Hamming distance, i.e.
from scipy.spatial.distance import cdist
l1 = l1.reshape(-1, 2) # After np.array
l2 = l2.reshape(-1, 2)
r = cdist(l1, l2, 'hamming')
matches = np.count_nonzero(1 - r >= threshold)
but the number of matches found by the respective solutions are different. I've noticed that it is possible to call cdist
with a function, cdist(XA, XB, f)
but I have not succeeded in writing my implementation of hamming_distance
so that it broadcasts properly.
I've looked at this question/answer but it presumes that both lists are of the same length which is not the case here.