How can I calculate pairwise cosine similarity across multiple vectors in Python?

Question

For the purposes of keeping it simple I have four vectors -- W, X, Y, Z -- that contain a number of values (each the same length). I'm trying to calculate cosine similarity across them pairwise in Python, but I can't seem to get the right answer.

If I try comparing W vs. X:

print(np.dot(W, X.T)/(np.linalg.norm(W)*np.linalg.norm(X)))

I get the following result:

[[0.9984622004973391]]

If I compare W vs. Y I get:

[[0.8891911653057049]]

And if I compare W to Z I get:

[[0.9676746591879851]]

I of course don't want to do these manually one by one, however, as I have many vectors in reality.

When I try to calculate all three (X, Y, Z) vs. W at once:

V = pd.concat([X, Y, Z])
print(np.dot(W, V.T)/(np.linalg.norm(W)*np.linalg.norm(V)))

I get the following:

[[0.9982175434442747 0.005561082504669956 0.020547860729214433]]

...where the first nearly matches what I had gotten running them singularly (but still not quite), while the others are way off.

I must have an issue with my approach to the all at once version, but I have not been able to figure out how to fix it. Any ideas? Thanks!

score 1 · Answer 1 · answered Jun 09 '23 at 15:30

I found np.dot to be a bit inflexible, so I opted to use an element-wise multiplication with a sum along the correct axis. For the norm of v, you also have to specify the axis or it'll just calculate the norm of the matrix.

w, x, y, z = np.random.random((4, 3))

v = np.array([x, y, z])

cos = np.sum(w * v, axis=1) / np.linalg.norm(w) / np.linalg.norm(v, axis=1)

assert cos[0] == np.dot(w, x.T) / np.linalg.norm(w) / np.linalg.norm(x)
assert cos[1] == np.dot(w, y.T) / np.linalg.norm(w) / np.linalg.norm(y)
assert cos[2] == np.dot(w, z.T) / np.linalg.norm(w) / np.linalg.norm(z)

William Hideki Nakata · Accepted Answer · 2023-06-09T16:29:32.970

When you execute np.dot(W, V.T), gets three values like

[[3.9353 2.4442 2.418 ]]

For each value, you must have a different normalization (for X, Y, Z), when you call np.linalg.norm(V) you get just one value (norm of Matrix V). To calculate the norm for each of the vectors (located in each line), you must add the parameter axis=1.

Finnaly the correct and short code looks like this:

V = np.concatenate([X, Y, Z])
cos_sim = (W @ V.T)/(np.linalg.norm(W)*np.linalg.norm(V, axis=1))
print(cos_sim)

How can I calculate pairwise cosine similarity across multiple vectors in Python?

2 Answers2