Python: scipy/numpy all pairs computation between two 1-D vectors

Question

I have two lists l1 and l2 containing integers that may be of different lengths, and I want to perform a computation between every possible pairing between these two vectors.

Specifically, I'm checking the Hamming distance between each pair and if the distance is sufficiently small I want to "count" it.

Naively, this could be implemented

def hamming_distance(n1: int, n2: int) -> float:
    return bin(n1 ^ n2).count('1')/32.0

matches = 0

for n1 in l1:
    for n2 in l2:
        sim = 1 - hamming_distance(n1, n2)

        if sim >= threshold:
            matches += 1

But this is not very fast.

I've unsuccessfully tried to leverage scipy.spatial.distance.cdist, where I figured that I would first compute the Hamming distance between all pairs, as the scipy.spatial.cdist documentation states that it will

Compute distance between each pair of the two collections of inputs.

and then count the number of elements satisfying the predicate that 1 - d >= threshold where d is the Hamming distance, i.e.

from scipy.spatial.distance import cdist

l1 = l1.reshape(-1, 2)  # After np.array
l2 = l2.reshape(-1, 2)
r = cdist(l1, l2, 'hamming')
matches = np.count_nonzero(1 - r >= threshold)

but the number of matches found by the respective solutions are different. I've noticed that it is possible to call cdist with a function, cdist(XA, XB, f) but I have not succeeded in writing my implementation of hamming_distance so that it broadcasts properly.

I've looked at this question/answer but it presumes that both lists are of the same length which is not the case here.

score 1 · Answer 1 · answered Sep 25 '19 at 08:09

1

You can use np.bitwise_xor.outer together with np.binary_repr and np.char.count:

import numpy as np

a = np.random.randint(0, 10, size=5)
b = np.random.randint(0, 10, size=5)

binary_repr = np.vectorize(np.binary_repr)
distance = np.char.count(binary_repr(np.bitwise_xor.outer(a, b)), '1') / 32

Then to get the matches:

matches = np.sum(distance >= threshold)

answered Sep 25 '19 at 08:09

a_guest

34,165
12
64
118

While this has feature parity, it wasn't significantly faster. The accepted answer resulted in a speed-up of about 100x in my use-case. – Filip Allberg Sep 25 '19 at 12:55

score 1 · Accepted Answer · answered Sep 25 '19 at 12:09

Here are three approaches using

scipy.spatial.KDTree
scipy.spatial.distance.cdist
a lookup table

On a pair of 32bit int vectors of lengths 100 and 200 they all give the same result; speedwise they compare as follows:

count_sim_kd 16.408800622448325 ms
count_sim_cd 12.41896384395659 ms
count_sim_lu 0.8755046688020229 ms

So at this problem size look up wins by a huge margin.

Code:

import numpy as np
from scipy.spatial import cKDTree as KDTree
from scipy.spatial.distance import cdist

l1 = np.random.randint(0,2**32,100)
l2 = np.random.randint(0,2**32,200)
threshold = 10/32

def hamming_distance(n1: int, n2: int) -> float:
    return bin(n1 ^ n2).count('1')/32.0

matches = 0

for n1 in l1:
    for n2 in l2:
        sim = 1 - hamming_distance(n1, n2)

        if sim >= threshold:
            matches += 1

def count_sim_kd(a,b,th):
    A,B = (KDTree(np.unpackbits(x[:,None].view(np.uint8),axis=1))
           for x in (a,b))
    return A.sparse_distance_matrix(B,max_distance=32-int(32*th),p=1).nnz

def count_sim_cd(a,b,th):
    A,B = (np.unpackbits(x[:,None].view(np.uint8),axis=1) for x in (a,b))
    return np.count_nonzero(cdist(A,B,"minkowski",p=1)<=32-int(32*th))


lu = sum(np.unravel_index(np.arange(256),8*(2,)))
def count_sim_lu(a,b,th):
    return np.count_nonzero(lu[(a[:,None,None]^b[None,:,None])
                               .view(np.uint8)].sum(2)<=32-int(32*th))

from timeit import timeit
for f in (count_sim_kd,count_sim_cd,count_sim_lu):
    assert f(l1,l2,threshold)==matches
    print(f.__name__,timeit(lambda:f(l1,l2,threshold),number=100)*10,'ms')

I'm starting to have memory issues. .view(np.uint8)].sum(2) <= 32 - int(32*th)) MemoryError: Unable to allocate 14.3 GiB for an array with shape (15488, 15488, 8) and data type int64 Would you recommend splicing `a` and `b` into halves? Or some other approach. Don't understand why numpy treats this as a int64 array. — Filip Allberg, Jan 27 '20 at 13:24
@FilipAllberg It's because the lu table uses `int64`. Could you pls check whether `lu = sum(map(np.uint8,np.unravel_index(np.arange(256),8*(2,))))` solves your problem? — Paul Panzer, Jan 27 '20 at 13:47
It did! I was just adding print(array.dtype) statements everywhere trying to figure this out! No out of memory error. Life-saver you! — Filip Allberg, Jan 27 '20 at 14:59

Python: scipy/numpy all pairs computation between two 1-D vectors

2 Answers2