The following code uses spaCy word vectors to find the 20 most similar words to a given word by first computing cosine similarity for all words in the vocabulary (more than a million), then sorting this list of the most similar words.
parser = English()
# access known words from the parser's vocabulary
current_word = parser.vocab[word]
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != word})
# sort by similarity
allWords.sort(key=lambda w: cosine(w.vector, current_word.vector))
allWords.reverse()
print("Top 20 most similar words to %s:") % word
for word in allWords[:20]:
print(word.orth_)
What I would like to know is whether there is a way to restrict spaCy's vocabulary only to the words that occur in a given list, which I hope would hugely reduce the cost of the sort operation.
To be clear, I would like to pass in a list of just a few words, or just the words in a given text, and be able to rapidly look up which of these words are nearest each other in spaCy's vector space.
Any help on this front appreciated.