0

I have been looking at this for the past hour but can not seem to find the problem... I have a list of articles on which I want to see which articles are similar to each other.

I have done this by computing the cosine similarities between the TF-IDF vectors of the articles and making a t-SNE plot of the result. I have done this in 2 ways but what surprised me is that the plots are very different from each other, and I do not see which one is correct.

In the examples, tfdoc is the TF-IDF.

from sklearn.metrics.pairwise import cosine_similarity
from sklearn import manifold

X = cosine_similarity(tfdoc, tfdoc)
model = manifold.TSNE(random_state=1, metric="precomputed")
Y = model.fit_transform(X) 

when plotted, this results in:

enter image description here

But when I use this code:

from sklearn.manifold import TSNE

tsne = TSNE(random_state=1, metric="cosine")

embs = tsne.fit_transform(tfdoc)

It results in:

enter image description here

Does someone know what the difference here exactly is?

Thanks in advance!!

HenkieTee
  • 21
  • 5

1 Answers1

0

The first test uses cosine-similarity, whereas the second uses cosine-distance. Normally, larger cosine distances means smaller cosine similarity.

James LI
  • 133
  • 1
  • 8