word2vec limit similar_by_vector() result to re-trained corpus

Question

Assume you have a (wikipedia) pre-trained word2vec model, and train it on an additional corpus (very small, 1000 scentences).

Can you imagine a way to limit a vector-search to the "re-trained" corpus only?

For example

model.wv.similar_by_vector()

will simply find the closest word for a given vector, no matter if it is part of the Wikipedia corpus, or the re-trained vocabulary.

On the other hand, for 'word' search the concept exists:

most_similar_to_given('house',['garden','boat'])

I have tried to train based on the small corpus from scratch, and it somewhat works as expected. But of course could be much more powerful if the assigned vectors come from a pre-trained set.

score 0 · Answer 1 · answered May 19 '19 at 20:16

Sharing an efficient way to do this manually:

re-train word2vec on the additional corpus
create full unique word-index of corpus
fetch re-trained vectors for each word in the index
instead of the canned function "similar_by_vector", use scipy.spatial.KDTree.query()

This finds the closest word within the given corpus only and works as expected.

gojomo · Answer 2 · 2019-05-20T22:00:37.973

Similar to the approach for creating a subset of doc-vectors in a new KeyedVectors instance suggested here, assuming small_vocab is a list of the words in your new corpus, you could try:

subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(small_vocab, w2v_model.wv[small_vocab])

Then subset_vectors contains just the words you've selected, but supports familiar operations like most_similar().

word2vec limit similar_by_vector() result to re-trained corpus

2 Answers2