3
  1. Build_vocab extend my old vocabulary?

For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab()

  1. Where should I use it? Should I put it after "gensim.doc2vec()"?

For example:

sentences = gensim.models.doc2vec.TaggedLineDocument(f_path)
dm_model = gensim.models.doc2vec.Doc2Vec(sentences, dm=1, size=300, window=8, min_count=5, workers=4)
dm_model.build_vocab()
cody.tv.weber
  • 536
  • 7
  • 15
Cherrymelon
  • 412
  • 2
  • 7
  • 17

1 Answers1

9

You should follow working examples in gensim documentation/tutorials/notebooks or online tutorials to understand which steps are necessary and in what order.

In particular, if you provide your sentences corpus iterable on the Doc2Vec() initialization, it will automatically do both the vocabulary-discovery pass and all training – so you don’t then need to call either build_vocab() or train() yourself. And further, you would never call build_vocab() with no arguments. (No working example in docs or online will do what your code does – so don’t improvise new things until you’ve followed the examples and know why they do what they do.)

There is an optional update argument to build_vocab(), which purports to allow the expansion of a vocabulary from an earlier training session (in preparation for further training with the newer words). HOWEVER, it’s only been developed/tested with regard to Word2Vec models – there are reports it causes crashes when used with Doc2Vec. And even in Word2Vec, its overall effects and best-ways-to-use aren’t clear, across all training modes. So I don’t recommend its use except for experts who can read & interpret the source code, and many involved tradeoffs, on their own. If you receive a chunk of new texts, with new words, the best-grounded course of action, and easiest to evaluate/reason-about, is to re-train from scratch, using a combined corpus of all text examples.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Is there any advantage for calling build_vocab separately or during Doc2Vec initialization? Especially with regards to speed – bendl Feb 25 '20 at 18:38
  • 2
    If you supply a corpus iterable when you instantiate `Doc2Vec`, the initialization method will just call `build_vocab()` and then `train()` for you. (If you don't supply a corpus, it just skips that step & waits for you to call them.) So the total time required to instantiate/initialize/build-vocab/train is the same either way. Doing it explicitly just shows each step in your code more clearly, & gives you the option of either doing extra steps between each step, or varying the usual parameters. (Typically only more advanced tinkering requires that.) – gojomo Feb 25 '20 at 23:12