0

I have two different corpus and what i want is to train the model with both and to do it it I thought that it could be something like this:

model.build_vocab(sentencesCorpus1)
model.build_vocab(sentencesCorpus2)

Would it be right?

Mikel Laburu
  • 157
  • 1
  • 12

1 Answers1

0

No: each time you call build_vocab(corpus), like that, it creates a fresh vocabulary from scratch – discarding any prior vocabulary.

You can provide an optional argument to build_vocab(), update=True, which tries to add to the existing vocabulary. However:

  • it wasn't designed/tested with Doc2Vec in mind, and as of right now (February 2018), using it with Doc2Vec is unlikely to work and often causes memory-fault crashes. (See https://github.com/RaRe-Technologies/gensim/issues/1019.)

  • it's still best to train() with all available data together - any sort of multiple-calls to train(), with differing data subsets each time, introduces other murky tradeoffs in model quality/correctness that are easy to get wrong. (And, when calling train(), be sure to provide correct values for its required parameters – the practices shown in most online examples are typically only correct for the case where build_vocab() was called once, with exactly the same texts as later calling train().)

gojomo
  • 52,260
  • 14
  • 86
  • 115