Tensorboard embedding visualization: what is cosine distance?

Question

I'm PhD student in digital humanities. I'm quite new to programming languages.

I have a problem that is freaking me out since last month. I'm trying to visualize a doc2vec model (python, gensim library) on the embeddings projector in Tensorboard but I'm not getting what I expect.

I'm sure that I'm missing out something really basic here...however, summing up

If I pick up a random vector in Tensorboard the most similar vectors are completely different than in my model. Is that because of the dimensionality reduction or what?
A lot of vectors have cosine similarity that is higher than one and I really don't understand what I'm doing wrong here. Someone told me that maybe my vectors are not normalized but I think Gensim does it already, doesn't it?

Here is the code I'm using to generate the embeddings. I tried also to change a bit the code, taking the vectors directly from "KeyedVectors" but nothing changed.

from gensim.scripts import word2vec2tensor
from gensim.models.doc2vec import Doc2Vec
doc2vec_model = Doc2Vec.load("doc2vec4.d2v")
doc2vec_model.save_word2vec_format('doc_tensor.w2v', doctag_vec=True, word_vec=False)
%run "C:..word2vec2tensor.py" -i doc_tensor.w2v -o my_plot

What I'm doing wrong here? Thanks in advance.

Hi, can you add your code for visualization + computation of cosine similarity ? — Joseph Budin, Jun 28 '19 at 09:56
Hi, I'm using the online version of tensorboard (http://projector.tensorflow.org), I dunno how to access the code here, I'm just using through its GUI — Leonardo Sanna, Jun 28 '19 at 10:17

score 2 · Accepted Answer · answered Jun 28 '19 at 10:34

2

Cosine distance is defined by 1-cosine_similarity, since cosine_similarity is in the interval [-1, 1], cosine_distance lies in [0, 2]. It is therefore normal that some distances are higher than 1. This is true for vectors that point in different directions.

As for your first question, since in your link, the explained variance of the PCA is ~8.5%, it is probable that the dimension reduction changes the neighbours of a given vector. You may want to try to reduce the dimension in your model too. Without more information on what your model is, it is hard to be more specific.

answered Jun 28 '19 at 10:34

Joseph Budin

1,299
1
11
28

1

This is already VERY helpful. Thanks! Wasn't aware that cosine distance and similarity had two different ranges. – Leonardo Sanna Jun 28 '19 at 10:39
Thanks, you may want to accept this as an answer :) – Joseph Budin Jun 28 '19 at 12:04

Tensorboard embedding visualization: what is cosine distance?

1 Answers1