3

Given an array of sentence embeddings (arrays of 512) with a shape of (1000000, 512) how do I calculate the cosine similarity of every one of the 1 million sentence embeddings of the array against every other sentence embedding of the array, ideally using tensorflow, so I can try and speed it up with a GPU?

jdoig
  • 1,472
  • 13
  • 27

2 Answers2

4

in this way you can calculate the cosine distance

X = np.random.uniform(0,10, (100,512)).astype('float32')
X = tf.constant(X)

def compute_cosine_distances(a, b):

    normalize_a = tf.nn.l2_normalize(a,1)        
    normalize_b = tf.nn.l2_normalize(b,1)
    distance = 1 - tf.matmul(normalize_a, normalize_b, transpose_b=True)

    return distance

compute_cosine_distances(X, X)

which is equal to

from sklearn.metrics.pairwise import pairwise_distances

pairwise_distances(X.numpy(), metric='cosine')
Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54
1

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. So, Cosine similarity of array with itself will be -1 always.

import tensorflow as tf
y_true = [[2., 8.], [1., 7.]]
y_pred = [[2., 8.], [1., 7.]]
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
print(cosine_loss(y_true, y_pred).numpy())

output: -1.0000001

  • 1
    Sorry maybe I didn't ask the question correctly. What I want is each element compared against every other element in the array. so given sentence embeddings [a,b,c] I want to know how similar a is to b & c, how similar b is to a & c and how similar c is to a & b – jdoig Jun 05 '20 at 08:22