0

I am pretty new to doc2vec then I made small research and found a couple of things. Here is my story: I am trying to learn using doc2vec 2.4 million documents. At first, I tried only doing so with a small model of 12 documents. I checked the results with infer vector of the first document and found it to be similar indeed to the first document by 0.97-0.99 cosine similarity measure. Which I found good, even though when I tried to enter a new document of completely different words I received a high score of 0.8 measure similarity. However, I had put it aside and tried to go on and build the full model with the 2.4 million documents. In this point, my problems began. The result was complete nonsense, I received in the most_similar function results with a similarity of 0.4-0.5 which were completely different from the new document checked. I tried to tune parameters but no result yet. I tried also to remove randomness both from the small and big model, however, I still got different vectors. Then I had tried to use get_latest_training_loss on each epoch in order to see how the loss changes over each epoch. This is my code:

model = Doc2Vec(vector_size=300, alpha=0.025, min_alpha=0.025, pretrained_emb=".../glove.840B.300D/glove.840B.300d.txt", seed=1, workers=1, compute_loss=True)

workers=1, compute_loss=True)
model.build_vocab(documents)

for epoch in range(10):
    for i in range(model_glove.epochs):
        model.train(documents, total_examples = token_count, epochs=1)
        training_loss = model.get_latest_training_loss()
        print("Training Loss: " + str(training_loss))

    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha # fix the learning rate, no decay

I know this code is a bit awkward, but it is used here only to follow the loss. The error I receive is:

AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'

I tried looking at model. and auto-complete and found that indeed there is no such function, I found something similar name training_loss, but it gives me the same error.

Anyone here can give me an idea?

Thanks in Advance

Eli Borodach
  • 554
  • 3
  • 9
  • 22

1 Answers1

1

Especially as a beginner, there's no pressing need to monitor training-loss. For a long time, gensim didn't report it in any way for any models – and it was still possible to evaluate & tune models.

Even now, running-loss-reporting in gensim kind of a rough, incomplete, advanced/experimental feature – and after a recent refactoring it doesn't seem to have full support in Doc2Vec. (Notably, while having the loss level reach a plateau can be a helpful indicator that further training can't help, it is most definitely not the case that a model with arbitrarily-lower-loss is better than others. In particular, a model that achieves near-zero loss would likely be extremely overfit, and probably of little use for downstream applications.)

Regarding your general aim, of getting good vectors, with regard to the process you've described/shown:

  • Tiny tests (as with your 12 documents) don't really work with these algorithms, except to check that you're calling the steps with legal parameters. You shouldn't expect the similarities in such toy-sized tests to mean anything, even if they superficially meet expectations in some cases. The algorithms need lots of training data & large vocabularies to train sensible models. (So, your full 2.4 million docs should work well.)

  • You generally shouldn't be changing the default alpha/min_alpha values, or call train() multiple times in a loop. You can just leave those at their defaults, and call train() with your desired number of training epochs – and it will do the right thing. The approach in your shown code is a suboptimal and fragile anti-pattern – whichever online source you learned it from is misguided and severely outdated.

  • You haven't shown your inference code, but note that it will re-use the epochs, alpha, and min_alpha cached in the model instance from original initialization, unless you supply other values. And, the default epochs if not-specified is a value inherited from shared code with Word2Vec of just 5. Doing a mere 5 epochs, and leaving the effective alpha at 0.025 the whole time (as alpha=0.025, min_alpha=0.025 does to inference), is unlikely to give good results, especially on short docs. Common epochs values from published work are 10-20 - and doing at least as many for inference as were used for training is typical.

  • You are showing the use of a pretrained_emb initialization parameter that is not part of the standard gensim library, so perhaps you're using some other fork, based on some older version of gensim.. Note that it's not typical to initialize a Doc2Vec model with word-embeddings from elsewhere before training, so if doing that, you're already in advanced/experimental territory – which is premature if you're still trying to get some basic doc-vectors into reasonable shape. (And, usually people seek tricks like re-used word-vectors if they have a small corpus. With 2.4 million docs, you probably don't have such corpus problems – any word-vectors can be learned from your corpus along with doc-vectors, in the default way.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I tried to implement your suggestion no improvement was shown after 10 epochs or 20 epochs of training (I also remove alpha, min_alpha, and the pre-trained vectors). About the infer vector I made it without any parameters, and still, the results are really really poor. Can you give me idea of how I can cope with the situation? – Eli Borodach Aug 18 '19 at 07:43
  • What showed "no improvement"? (What are the exact results you're getting, and basis for judging them poor, comapred to what you expect?) If your code has changed significantly from what's shown in the original question, you should update or extend your question to show what you're doing now, so it will be clear if there are remaining problems. Also, as a general tip: be sure to run with logging on at least the INFO level and closely review the output (perhaps sharing it in your question) to see if all proper progress/interim-tallies make sense. – gojomo Aug 18 '19 at 12:50