Given an array of sentence embeddings (arrays of 512) with a shape of (1000000, 512) how do I calculate the cosine similarity of every one of the 1 million sentence embeddings of the array against every other sentence embedding of the array, ideally using tensorflow, so I can try and speed it up with a GPU?
Asked
Active
Viewed 1,553 times
2 Answers
4
in this way you can calculate the cosine distance
X = np.random.uniform(0,10, (100,512)).astype('float32')
X = tf.constant(X)
def compute_cosine_distances(a, b):
normalize_a = tf.nn.l2_normalize(a,1)
normalize_b = tf.nn.l2_normalize(b,1)
distance = 1 - tf.matmul(normalize_a, normalize_b, transpose_b=True)
return distance
compute_cosine_distances(X, X)
which is equal to
from sklearn.metrics.pairwise import pairwise_distances
pairwise_distances(X.numpy(), metric='cosine')

Marco Cerliani
- 21,233
- 3
- 49
- 54
1
Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. So, Cosine similarity of array with itself will be -1 always.
import tensorflow as tf
y_true = [[2., 8.], [1., 7.]]
y_pred = [[2., 8.], [1., 7.]]
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
print(cosine_loss(y_true, y_pred).numpy())
output: -1.0000001

Varchita Lalwani
- 87
- 4
-
1Sorry maybe I didn't ask the question correctly. What I want is each element compared against every other element in the array. so given sentence embeddings [a,b,c] I want to know how similar a is to b & c, how similar b is to a & c and how similar c is to a & b – jdoig Jun 05 '20 at 08:22