0

Assume you have a (wikipedia) pre-trained word2vec model, and train it on an additional corpus (very small, 1000 scentences).

Can you imagine a way to limit a vector-search to the "re-trained" corpus only?

For example

model.wv.similar_by_vector() 

will simply find the closest word for a given vector, no matter if it is part of the Wikipedia corpus, or the re-trained vocabulary.

On the other hand, for 'word' search the concept exists:

most_similar_to_given('house',['garden','boat'])

I have tried to train based on the small corpus from scratch, and it somewhat works as expected. But of course could be much more powerful if the assigned vectors come from a pre-trained set.

szeta
  • 589
  • 1
  • 5
  • 21

2 Answers2

0

Sharing an efficient way to do this manually:

  1. re-train word2vec on the additional corpus
  2. create full unique word-index of corpus
  3. fetch re-trained vectors for each word in the index
  4. instead of the canned function "similar_by_vector", use scipy.spatial.KDTree.query()

This finds the closest word within the given corpus only and works as expected.

szeta
  • 589
  • 1
  • 5
  • 21
0

Similar to the approach for creating a subset of doc-vectors in a new KeyedVectors instance suggested here, assuming small_vocab is a list of the words in your new corpus, you could try:

subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(small_vocab, w2v_model.wv[small_vocab])

Then subset_vectors contains just the words you've selected, but supports familiar operations like most_similar().

gojomo
  • 52,260
  • 14
  • 86
  • 115