By default, gensim Word2Vec only does vocabulary-discovery once. It will happen when you supply a corpus like your sentences
to the initial constructor (which does an automatic vocabulary-scan and train), or alternatively when you call build_vocab()
. While you can continue to call train()
, no new words will be recognized.
There is support (that I would consider experimental) for calling build_vocab()
with new text examples, and an update=True
parameter, to expand the vocabulary. While this would let further train()
calls train both old-and-new words, there are many caveats:
- such sequential training may not lead to models as good, or as self-consistent, as providing all examples interleaved. (For example, the continued training may drift words learned-from-later-batches arbitrarily far from words/word-senses in earlier batches that are not re-presented.)
- such calls to
train()
should use one of the optional parameters to give an accurate estimate of the new batch size (in words or examples) so that learning-rate decay and progress-logging is done properly
- the core algorithm and underlying theories aren't based on such batching, and multiple restarts of the learning-rate from high-to-low, so the interpretation of results – and relative strength/balance of resulting vectors - isn't as well-grounded
If at all possible, combine all your examples into one corpus, and do one large vocabulary-discovery then training.