2

I'm trying to implement vectorization for answer from this question

Fastest way to get hamming distance for integer array

r = (1 << np.arange(64, dtype=np.uint64))[:, None]
def hamming_distance_v2(a, b):
    t = np.bitwise_xor(a, b)
    p = np.bitwise_and(t, r)
    return np.count_nonzero(p != 0)

I want to pass an 2d array as first parameter, for example

a = [[127,255], [127,255]]
b = [127,240]
hamming_distance_v1(a, b) -> [4,4]

If 2d array as first argument is used, the following error is returned:

ValueError: unable to broadcast argument 1 to output array

Is there a way to implement vectorization on current realization of hamming distance or some other ways to count this distance between 2d and 1d arrays?

Alexander Karp
  • 328
  • 1
  • 5
  • 20

1 Answers1

1

Based on the Q&A link's answer, they would modify to incorporate the extra dim as shown next.

Approach #1

def hamming_distance(a, b):
    r = (1 << np.arange(8))[:,None]
    mask = (a[:,None] & r) != (b & r)
    return mask.sum((1,2))

Approach #2

def hamming_distance_v2(a, b):
    r = (1 << np.arange(8))[:,None]
    xor = np.bitwise_xor(a[:,None],b)
    mask = (xor & r) != 0
    return mask.sum((1,2))

Approach #3

Another with np.unpackbits -

def hamming_distance_v3(a, b):
    a = np.asarray(a, dtype=np.uint8)
    b = np.asarray(b, dtype=np.uint8)
    m = np.unpackbits(a,axis=1) != np.unpackbits(b)
    return m.sum(1)

Sample run -

In [107]: a
Out[107]: 
array([[127, 255],
       [127, 205],
       [227, 255]])

In [108]: b
Out[108]: array([127, 240])

In [109]: hamming_distance(a, b)
Out[109]: array([4, 5, 8])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • For looping over the a array and using non-vectorized is faster then using those vectorized functions, could it be a problem with length of a array? – Alexander Karp Sep 21 '20 at 18:38
  • 1
    @AlexanderKarp Faster than all of the posted three approaches? – Divakar Sep 21 '20 at 18:42
  • faster then hamming_distance and hamming_distance_v2, hamming_distance_v3 I can't use cause np.unpackbits can't be used with uint64 https://pastebin.com/S3rDJmr3 I got best result with step = 1 (27s), if step is length of slided array computation get stuck – Alexander Karp Sep 22 '20 at 07:33
  • 1
    @AlexanderKarp Can you share the loop version? – Divakar Sep 22 '20 at 09:59
  • Some clarification: I have a source array and a target array and I want to find the place in source array where hamming distance with target array is minimum possible. So there are two approach to do it: for looping with step=1 and sliding source array to windows to implement vectorization, there are both of them: https://pastebin.com/fF9K6X1z – Alexander Karp Sep 22 '20 at 10:15
  • Loop version with numba on array **a** with shape (2095250,) and array **b** (250,) runs for 20 seconds, vectorized version with windowed array **a** (2095001, 250) and **b** (250,) was killed by my OS with exit code 137 :) – Alexander Karp Sep 22 '20 at 13:14