I've been using the spacy en_core_web_lg
and wanted to try out en_core_web_trf
(transformer model) but having some trouble wrapping my head around the difference in the model/pipeline usage.
My use case looks like the following:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
s1 = nlp("Running for president is probably hard.")
s2 = nlp("Space aliens lurk in the night time.")
s1.similarity(s2)
Output:
The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements.
(0.0, Space aliens lurk in the night time.)
Looking at this post, the transformer model does not have a word vector in the same way en_core_web_lg
does, but you can get the embedding via s1._.trf_data.tensors
. Which looks like:
sent1._.trf_data.tensors[0].shape
(1, 9, 768)
sent1._.trf_data.tensors[1].shape
(1, 768)
So I tried to manually take the cosine similarity (using this post as ref):
def similarity(obj1, obj2):
(v1, t1), (v2, t2) = obj1._.trf_data.tensors, obj2._.trf_data.tensors
try:
return ((1 - cosine(v1, v2)) + (1 - cosine(t1, t2))) / 2
except:
return 0.0
But this does not work.