7

I have a set of 100 thousand vectors and I need to retrieve top-25 closest vector based on cosine similarity.

Scipy and Sklearn have implementations for computing cosine distance/similarity 2 vectors but I will need to compute the Cosine Sim for 100k X 100k size and then take out top-25. Is there any fast implemenet in python compute that?

As per @Silmathoron Suggestion, this is what I am doing -

#vectors is a list of vectors of size : 100K x 400 i.e. 100K vectors each of dimenions 400
vectors = numpy.array(vectors)  
similarity = numpy.dot(vectors, vectors.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

k = 26

box_plot_file = file("box_data.csv","w+")

for sim,query in itertools.izip(cosine,queries):
    k_largest = heapq.nlargest(k, sim)
    k_largest = map(str,k_largest)
    result = query + "," + ",".join(k_largest) + "\n"
    box_plot_file.write(result)
box_plot_file.close()
silent_dev
  • 1,566
  • 3
  • 20
  • 45
  • What do you mean by the "top-25 closest vector"? The top 25 closest pairs? Or something else? – Mark Dickinson Jun 25 '16 at 14:53
  • For each vector, I will compute cosine similarity with everyother vector and select 25 vectors for each vector with respect to the cosine similarity. – silent_dev Jun 25 '16 at 14:57
  • it depends how fast you want it... if you show us an example of your implementation with the time it takes (potentially on a subsample if it is really too slow), and tell us the desired speed increase, then we can tell you if it can be accelerated with better algorithmic in python only or if you need to go to cython or multithreading... – Silmathoron Jun 25 '16 at 18:00
  • Lucene could be a good option. Assuming you are working on text data, you want to take advantage of the sparse nature of the data. An inverted index could help. – David Maust Jun 26 '16 at 06:14

1 Answers1

3

I would try smarter algorithms first, rather than speeding up brute force (computing all pairs of vectors). KDTrees might work, scipy.spatial.KDTree(), if your vectors are of low dimension. If they are high dimension then you might need a random projection first: http://scikit-learn.org/stable/modules/random_projection.html

ericf
  • 250
  • 2
  • 9