Similarity in Spacy

Question

I am trying to understand how similarity in Spacy works. I tried using Melania Trump's speech and Michelle Obama's speech to see how similar they were.

This is my code.

import spacy
nlp = spacy.load('en_core_web_lg')

file1 = open("melania.txt").read().decode('ascii', 'ignore')
file2 = open("michelle.txt").read().decode('ascii', 'ignore')

doc1 = nlp(unicode(file1))
doc2 = nlp(unicode(file2))
print doc1.similarity(doc2)

I get the similarity score as 0.9951584208511974. This similarity score looks very high to me. Is this correct? Am I doing something wrong?

Does this answer your question? [Spacy, Strange similarity between two sentences](https://stackoverflow.com/questions/52113939/spacy-strange-similarity-between-two-sentences) — Johannes Filter, Oct 23 '21 at 12:22

score 17 · Answer 1 · answered Nov 24 '18 at 08:47

By default spaCy calculates cosine similarity. Similarity is determined by comparing word vectors or word embeddings, multi-dimensional meaning representations of a word.

It returns return (numpy.dot(self.vector, other.vector) / (self_norm * other_norm))

text1 = 'How can I end violence?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))

print(np.dot(doc1.vector, doc2.vector) / (np.linalg.norm(doc1.vector) * np.linalg.norm(doc2.vector)))

Output:

spaCy : 0.916553147896471
0.9165532

It seems that spaCy's .vector method created the vectors. Documentation says that spaCy's models are trained from GloVe's vectors.

thank you. I wanted to know why the score is so high. any insights? — thehydrogen, Nov 24 '18 at 16:32

score 5 · Answer 2 · edited Apr 28 '21 at 16:02

SpaCy's similarity for a sentence or a document is just the average of all the word vectors that constitute them. Hence, if 2 speeches (these will be multiple sentences)

have a lot of positive words
are produced in similar circumstances
use commonly used words

then the similarity between the associated word vector for each speech might be high. But if you do the same with just single short sentences, then it fails semantically.

For example, consider the two sentences below:

sentence 1: "This is about airplanes and airlines"

sentence 2: "This is not about airplanes and airlines"

Both sentences will give a high similarity score (0.989662) in SpaCy despite meaning the opposite. It seems that the vector of not is not that different from the rest of the words in the sentence and its vector_norm is also similar.

Similarity in Spacy

2 Answers2