Optimize Hamming Distance Python

Question

I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.

Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:

from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')

The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).

Is there any fastest way to do it?

I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.

You're trying to brute-force a problem you don't have anywhere near the resources to brute-force. There are much better ways to find nearest neighbors than by computing all pairwise distances and taking the low ones. — user2357112, Nov 22 '16 at 03:58

score 6 · Answer 1 · answered Nov 22 '16 at 04:58

If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).

Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.

from sklearn.neighbors import BallTree
import numpy as np

# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))

# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)

print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.

Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.

Balltree can query k-neighbours and over radius-r, that's great. I'll check how much time it saves, but already it's a way better solution than mine, thanks xD — jevanio, Nov 22 '16 at 17:16
It result to take a little more time that exhaustive search -.- — jevanio, Dec 09 '16 at 21:23

Optimize Hamming Distance Python

1 Answers1