Can i build vocaburay in twice with gensim word2vec or doc2vec?

Question

I have two different corpus and what i want is to train the model with both and to do it it I thought that it could be something like this:

model.build_vocab(sentencesCorpus1)
model.build_vocab(sentencesCorpus2)

Would it be right?

score 0 · Accepted Answer · answered Feb 22 '18 at 18:08

No: each time you call build_vocab(corpus), like that, it creates a fresh vocabulary from scratch – discarding any prior vocabulary.

You can provide an optional argument to build_vocab(), update=True, which tries to add to the existing vocabulary. However:

it wasn't designed/tested with Doc2Vec in mind, and as of right now (February 2018), using it with Doc2Vec is unlikely to work and often causes memory-fault crashes. (See https://github.com/RaRe-Technologies/gensim/issues/1019.)
it's still best to train() with all available data together - any sort of multiple-calls to train(), with differing data subsets each time, introduces other murky tradeoffs in model quality/correctness that are easy to get wrong. (And, when calling train(), be sure to provide correct values for its required parameters – the practices shown in most online examples are typically only correct for the case where build_vocab() was called once, with exactly the same texts as later calling train().)

1 Answers1