I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.
Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:
from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')
The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).
Is there any fastest way to do it?
I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.