Efficient way to compute cosine similarity between 1D array and all rows in a 2D array

Question

I have one 1D array of shape (300, ) and a 2D array of shape (400, 300). Now, I want to compute the cosine similarity between each of the rows in this 2D array to the 1D array. Thus, my result should be of shape (400, ) which represents how similar these vectors are.

My initial idea is to iterate thru the rows in 2D array using a for loop and then compute cosine similarity between vectors. Is there a faster alternative using broadcasting method?

Here is a contrived example:

In [29]: vec = np.random.randn(300,)
In [30]: arr = np.random.randn(400, 300)

Below is the way I want to calculate the similarity between 1D arrays:

inn = (vec * arr[0]).sum()  
vecnorm = numpy.sqrt((vec * vec).sum())  
rownorm = numpy.sqrt((arr[0] * arr[0]).sum())  
similarity_score = inn / vecnorm / rownorm

How can I generalize this to arr[0] being replaced with a 2D array?

How would your output be (300,)? if you have 400 vectors to "test against" then your output will be (400,), and a simple dot product will do... — Julien, Aug 28 '18 at 00:37
What's your cosine similarity calculation? You could give us a full working example with arrays like (4,3) and (3,) shapes. — hpaulj, Aug 28 '18 at 00:46
@hpaulj updated the question with these details. Please check! — kmario23, Aug 28 '18 at 00:52
For generalized solution for two 2D arrays see my other post https://stackoverflow.com/a/61643023/13484859 — milan.vancl, May 06 '20 at 18:58
See this discussion here https://codereview.stackexchange.com/questions/55717/efficient-numpy-cosine-distance-calculation — Manuel Alves, May 28 '21 at 13:43

score 4 · Answer 1 · edited Aug 28 '18 at 01:23

4

The numerator of cos similarity can be expressed as a matrix multiply and then the denominator should just work :).

a_norm = np.linalg.norm(a, axis=1)
b_norm = np.linalg.norm(b)
(a @ b) / (a_norm * b_norm)

where a is a 2D array and b is 1D array (i.e. vector)

edited Aug 28 '18 at 01:23

kmario23

57,311
13
161
150

answered Aug 28 '18 at 00:54

Bi Rico

25,283
3
52
75

2

This approach is **10x** faster than the method of using `cdist` from scipy. – kmario23 Aug 28 '18 at 01:51

score 3 · Answer 2 · answered Aug 28 '18 at 00:53

You can use cdist:

import numpy as np
from scipy.spatial.distance import cdist


x = np.random.rand(1, 300)
Y = np.random.rand(400, 300)

similarities = 1 - cdist(x, Y, metric='cosine')
print(similarities.shape)

Output

(1, 400)

Notice that cdist returns the cosine_distance (more here), that is 1 - cosine_similarity so you need to convert the result.

Divakar · Accepted Answer · 2018-08-28T05:57:39.247

Here's one following the same method as with @Bi Rico's post, but with einsum for the norm computations -

den = np.sqrt(np.einsum('ij,ij->i',arr,arr)*np.einsum('j,j',vec,vec))
out = arr.dot(vec) / den

Also, we can use vec.dot(vec) to replace np.einsum('j,j',vec,vec) for some marginal improvement.

Timings -

In [45]: vec = np.random.randn(300,)
    ...: arr = np.random.randn(400, 300)

# @Bi Rico's soln with norm
In [46]: %timeit (np.linalg.norm(arr, axis=1) * np.linalg.norm(vec))
10000 loops, best of 3: 100 µs per loop

In [47]: %timeit np.sqrt(np.einsum('ij,ij->i',arr,arr)*np.einsum('j,j',vec,vec))
10000 loops, best of 3: 77.4 µs per loop

On bigger arrays -

In [48]: vec = np.random.randn(3000,)
    ...: arr = np.random.randn(4000, 3000)

In [49]: %timeit (np.linalg.norm(arr, axis=1) * np.linalg.norm(vec))
10 loops, best of 3: 22.2 ms per loop

In [50]: %timeit np.sqrt(np.einsum('ij,ij->i',arr,arr)*np.einsum('j,j',vec,vec))
100 loops, best of 3: 8.18 ms per loop

Efficient way to compute cosine similarity between 1D array and all rows in a 2D array

3 Answers3