What are doc2vec training iterations?

Question

I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.

However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop? Please let me know how I should change the following code to train the model for 20 epoches.

Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.

# Import libraries
from gensim.models import doc2vec
from collections import namedtuple

# Load data
doc1 = ["This is a sentence", "This is another sentence"]

# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors
model.docvecs[0]
model.docvecs[1]

score 7 · Accepted Answer · edited May 17 '18 at 13:23

7

Word2Vec and related algorithms (like 'Paragraph Vectors' aka Doc2Vec) usually make multiple training passes over the text corpus.

Gensim's Word2Vec/Doc2Vec allows the number of passes to be specified by the iter parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs to the Doc2Vec(docs, ...) constructor call.)

If unspecified, the default iter value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.

Published Doc2Vec work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec initialization to:

model = doc2vec.Doc2Vec(docs, iter=20, ...)

Because Doc2Vec often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec corpus, it's thinkable to use fewer than the default-number of passes.)

You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab() and train() steps yourself, instead of the easier step of supplying the docs corpus in the initializer call to trigger immediate training, then you must supply an epochs argument to train() – and it will perform that number of passes, so you still only need one call to train().

edited May 17 '18 at 13:23

Simon Hessner

1,757
1
22
49

answered Oct 18 '17 at 17:39

gojomo

52,260
14
86
115

Thanks a lot for your wonderful and superb answer :) If I am using CBOW word2vec would is it correct to use the same `iter` parameter to train multiple times? i.e. `model = word2vec.Word2Vec(sentences, sg=0, iter=10, ...)? – Oct 18 '17 at 22:51
1

Yes, `Word2Vec` and `Doc2Vec` both support the `iter` parameter in their initialization-method! – gojomo Oct 19 '17 at 04:30
1

Thanks a lot :) Is there a difference using documents directly in `doc2vec` and using documents as below. `model.build_vocab(sentences) for epoch in range(10): model.train(sentences)`? Does it produce the same document vectors? – Oct 19 '17 at 07:44
2

Calling `train()` 10 times like that has many problems. If the call doesn't error for you, and you left the default `iter` at 5, each call does 5 passes – so you'll get 50 total passes over the data, not 10. Also, each call to `train()` glides the learning rate from the starting `alpha` to the `min_alpha`, so it will go high-low, high-low, etc – not at all correct for SGD. But also, because this error was common, latest gensim won't even let you call `train()` without an explicit `epochs` argument, so it'll error in latest gensim. Don't do it. – gojomo Oct 19 '17 at 18:11
1

Thanks a lot for the perfect answer. So, what you recommend is to use `model = doc2vec.Doc2Vec(docs, iter=20, ...)`this, instead of `model.build_vocab(sentences) for epoch in range(10): model.train(sentences)` right? :) – Oct 19 '17 at 22:35
2

Yes, because the latter won't do what you intend, or manage the learning-rate right, or even run without an error (in recent gensim versions). – gojomo Oct 19 '17 at 23:49
In which cases would it be better to not specify docs in the constructor but call train() once with the correct number of epochs? – Simon Hessner May 17 '18 at 12:11
1

Supplying the docs in the constructor causes `build_vocab()` and `train()` to run automatically, so if you just want them run in the usual way, that's an OK way to use the model. If you wanted to time the steps separately, or do some analysis/logging after the vocabulary is discovered, or possibly even split up the `build_vocab()` into its constituent steps (and do something like iteratively try multiple `min_count` values to `scale_vocab()` to hit a certain surviving-vocabulary size or model RAM size), then you'd want to avoid the automatic calls, and do them explicitly separately. – gojomo May 17 '18 at 16:44

What are doc2vec training iterations?

1 Answers1

Linked