3

Given two matrices X1 (N,3136) and X2 (M,3136) (where every element in every row is an binary number) i am trying to calculate hamming distance so that each element in X1 is compared to all of the rows from X2, such that result matrix is (N,M).

I have written two function for it (first one with help of numpy and the other one without numpy):

def hamming_distance(X, X_train):
    array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])

    return array



def hamming_distance2(X, X_train):
    a = len(X[:,0])
    b = len(X_train[:,0])

    hamming_distance = np.zeros(shape=(a, b))

    for i in range(0, a):
        for j in range(0, b):
            hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])

    return hamming_distance

My problem is that upper function is much slower than lower one where I use two for loops. Is it possible to improve on first function so that I use only one loop?

PS. Sorry for my english, it isn't my first language, although I was trying to do my best!

Derik Daro
  • 97
  • 2
  • 9
  • Hamming-distance is already available in scipy. Maybe use it or learn from it. – sascha May 30 '17 at 20:45
  • I will need to upload it on server as my final project to predict handwritten characters. We are only allowed to use numpy and Python 3.6 . – Derik Daro May 30 '17 at 20:46
  • 2
    [So learn from it](https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py#L547) (```(u != v).mean()```). – sascha May 30 '17 at 20:46
  • 2
    `(u != v).mean()` wouldn't make it much more efficient than the current version as the loop is still necessary. How big is N and M? Maybe calculate all pairwise comparisons with broadcasting and slice only the ones that are necessary. It may easily fill up your memory though. – ayhan May 30 '17 at 20:52
  • N is 2500 and M is about 6000. Maximum time to run the program is 1 minute so I would like to improve on hamming distance in order to save on time and pass it much larger training set. – Derik Daro May 30 '17 at 20:55
  • Note, in the first example you are building a list, then converting the final result of the list comprehension into an array, so it is no surprise it is taking longer than the second, which builds the array directly. Essentially, what you want takes two nested loops regardless, in the first function that nested loop is merely hidden from you. – juanpa.arrivillaga May 30 '17 at 21:19
  • 1
    This just goes to show that numpy is not magic fairy dust; it only helps with performance if you use it appropriately. And your second one uses numpy too. – Andras Deak -- Слава Україні May 30 '17 at 21:26
  • 1
    My failed attempt was to use dot product because of the similarities of the two operations. Given two binary matrices, if you multiply one of them with (1-other) it gives the number of different entries where the first one is 1 and the other one is zero. So `(1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)` actually gives the hamming distance. Was it faster than loops though? Unfortunately no. :) – ayhan May 30 '17 at 21:56
  • 1
    Time this with arrays of various sizes. For smaller ones `hamming_distance` IS faster, but with size it's relative speed declines. And the full broadcasting version starts to run into memory size errors. – hpaulj May 31 '17 at 00:28
  • 2
    @ayhan To extract the best performance, we need to leverage BLAS with float values. Post that dot based with float32 converted arrays as an answer post. It wins by a big margin. – Divakar May 31 '17 at 03:06
  • @Divakar wow I didn't know that integer vs float would make that much difference (88x for a big array). Let me post that. Thank you. – ayhan May 31 '17 at 04:42

2 Answers2

3

Numpy only makes your code much faster if you use it to vectorize your work. In your case you can make use of array broadcasting to vectorize your problem: compare your two arrays and create an auxiliary array of shape (N,M,K) which you can sum along its third dimension:

hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)

We inject a singleton dimension into the first array to make it of shape (N,1,K), the second array is implicitly compatible with shape (1,M,K), so the operation can be performed.

In the comments @ayhan noted that this will create a huge auxiliary array for large M and N, which is quite true. This is the price of vectorization: you gain CPU time at the cost of memory. If you have enough memory for the above to work, it will be very fast. If you don't, you have to reduce the scope of your vectorization, and loop in either M or N (or both; this would be your current approach). But this doesn't concern numpy itself, this is about striking a balance between available resources and performance.

2

What you are doing is very similar to dot product. Consider these two binary arrays:

1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1

We are trying to find the number of different pairs. If you directly take the dot product, it gives you the number of (1, 1) pairs. However, if you negate one of them, it will count the different ones. For example, a1.dot(1-a2) counts (1, 0) pairs. Since we also need the number of (0, 1) pairs, we will add a2.dot(1-a1) to that. The good thing about dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out.

Here's a demo:

prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)

np.allclose(res1, res2)
Out: True

And timings:

%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop

%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop

%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop
ayhan
  • 70,170
  • 20
  • 182
  • 203