Can I prune a parser's vocabulary in spaCy?

Question

The following code uses spaCy word vectors to find the 20 most similar words to a given word by first computing cosine similarity for all words in the vocabulary (more than a million), then sorting this list of the most similar words.

parser = English()

# access known words from the parser's vocabulary
current_word = parser.vocab[word]

# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != word})

# sort by similarity
allWords.sort(key=lambda w: cosine(w.vector, current_word.vector))
allWords.reverse()

print("Top 20 most similar words to %s:") % word
for word in allWords[:20]:   
    print(word.orth_)

What I would like to know is whether there is a way to restrict spaCy's vocabulary only to the words that occur in a given list, which I hope would hugely reduce the cost of the sort operation.

To be clear, I would like to pass in a list of just a few words, or just the words in a given text, and be able to rapidly look up which of these words are nearest each other in spaCy's vector space.

Any help on this front appreciated.

score 1 · Answer 1 · answered Nov 07 '17 at 17:09

The SpaCy documentation says that:

The default English model installs vectors for one million vocabulary entries, using the 300-dimensional vectors trained on the Common Crawl corpus using the GloVe algorithm. The GloVe common crawl vectors have become a de facto standard for practical NLP.

So you could just load the GloVe vectors using Gensim. I'm not sure if you can load them directly, or if you have to use this script.

If you have loaded the word vectors in Gensim as model, you can simply use word_vectors.similarity('woman', 'man') to get the similarity between two words. If you have a list of words, you could do something like:

def most_similar(word, candidates, model, n=20):
    "Get N most similar words from a list of candidates"
    similarities = [(model.similarity(word,candidate), candidate) 
                    for candidate in candidates]
    most_similar_words = sorted(similarities, reverse=True)[:n]
    only_words = [w for sim,w in most_similar_words]
    return only_words

score 0 · Answer 2 · answered Dec 24 '20 at 10:42

Spacy has a Vectors class that has a most_similar method. You could then define a wrapper function like so to avoid writing your own implementation:

import spacy
import numpy as np

def most_similar(word, model, n=20):
    nlp = spacy.load(model)
    doc = nlp(word)
    vecs = [token.vector for token in doc]
    queries = np.array(vecs)
    keys_arr, best_rows_arr, scores_arr = nlp.vocab.vectors.most_similar(queries, n=n)
    keys = keys_arr[0] # The array of keys is nested in another array from the previous step.
    similar_words_list = [nlp.vocab[key].text for key in keys]
    return similar_words_list

And call it like this: most_similar('apple', 'en_core_web_md', n=20) This would find the 20 most similar words using cosine similarity of the word "apple" based on the the Spacy model package "en_core_web_md".

This is the result: ['BLACKBERRY', 'APPLE', 'apples', 'PRUNES', 'iPHone', '3g/3gs', 'fruit', 'FIG', 'CREAMSICLE', 'iPad', 'ipad4', 'LONGAN', 'CALVADOS', 'iPOD', 'iPod', 'SORBET', 'PERSICA', 'peach', 'juice', 'JUICE']

Can I prune a parser's vocabulary in spaCy?

2 Answers2