How to continue training Doc2Vec with a specific domain corpus after training with a generic corpus

Question

I want to train a Doc2Vec model with a generic corpus and, then, continue training with a domain-specific corpus (I have read that is a common strategy and I want to test results).

I have all the documents, so I can build and tag the vocab at the beginning.

As I understand, I should train initially all the epochs with the generic docs, and then repeat the epochs with the ad hoc docs. But, this way, I cannot place all the docs in a corpus iterator and call train() once (as it is recommended everywhere).

So, after building the global vocab, I have created two iterators, the first one for the generic docs and the second one for the ad hoc docs, and called train() twice.

Is it the best way or it is a more appropriate way?

If the best, how I should manage alpha and min_alpha? Is it a good decision not to mention them in the train() calls and let the train() manage them?

Best

Alberto

gojomo · Answer 1 · 2020-11-16T21:36:08.493

This is probably not a wise strategy, because:

the Python Gensim Doc2Vec class hasn't ever properly supported expanding its known vocabulary after a 1st single build_vocab() call. (Up through at least 3.8.3, such attempts typically cause a Segmentation Fault process crash.) Thus if there are words that are only in your domain-corpus, an initial typical initialization/training on the generic-corpus would leave them out of the model entirely. (You could work around this, with some atypical extra steps, but the other concerns below would remain.)
if there is truly an important contrast between the words/word-senses used in your generic and the different words/word-senses used in your domain corpus, influence of the words from the generic corpus may not be beneficial, diluting domain-relevant meanings
further, any followup training that just uses a subset of all documents (the domain corpus) will only be updating the vectors for that subset of words/word-senses, and the model's internal weights used for further unseen-document inference, in directions that make sense for the domain-corpus alone. Such later-trained vectors may be nudged arbitrarily far out of comparable alignment with other words not appearing in the domain-corpus, and earlier-trained vectors will find themselves no longer tuned in relation to the model's later-updated internal-weights. (Exactly how far will depend on the learning-rate alpha & epochs choices in the followup training, and how well that followup training optimizes model loss.)

If your domain dataset is sufficient, or can be grown with more domain data, it may not be necessary to mix in other training steps/data. But if you think you must try that, the best-grounded approach would be to shuffle all training data together, and train in one session where all words are known from the beginning, and all training examples are presented in balanced, interleaved fashion. (Or possibly, where some training texts considered extra-important are oversampled, but still mixed in with the variety of all available documents, in all epochs.)

If you see an authoritative source suggesting such a "train with one dataset, then another disjoint dataset" approach with the Doc2Vec algorithms, you should press them for more details on what they did to make that work: exact code steps, and the evaluations which showed an improvement. (It's not impossible that there's some way to manage all the issues! But I've seen many vague impressions that this separate-pretraining is straightforward or beneficial, and zero actual working writeups with code and evaluation metrics showing that it's working.)

Update with respect to the additional clarifications you provided at https://stackoverflow.com/a/64865886/130288:

Even with that context, my recommendation remains: don't do this segmenting of training into two batches. It's almost certain to degrade the model compared to a combined training.

I would be interested to see links to the "references in the literature" you allude to. They may be confused or talking about algorithms other than the Doc2Vec ("Paragraph Vectors") algorithm.

If there is any reason to give your domain docs more weight, a better-grounded way would be to oversample them in the combined corpus.

Bu by all means, test all these variants & publish the relative results. If you're exploring shaky hypotheses, I would ignore any advice from StackOverflow-like sources & just run all the variants that your reading of the literature suggest, to see which, if any actually help.

You're right to recognized that the choice of alpha parameters is a murky area that could majorly influence what impact such add-on training has. There's no right answer, so you'll have to search-for and reason-out what might make sense. The inherent issues I've mentioned with such subset-followup-training could make it so that even if you find benefits in some combos, they may be more a product of a lucky combination of data & arbitrary parameters than a generalizable practice.

And: your specific question "if it is better to set such values or not provide them at all" reduces to: "do you want to use the default values, or values set when the model was created, or not?"

Which values might be workable, if at all, for this unproven technique is something that'd need to be experimentally discovered. That is, if you wanted to have comparable (or publishable) results here, I think you'd have to justify from your own novel work some specific strategy for choosing good alpha/epochs and other parameters, rather than adopt any practice merely recommended in a StackOverflow answer.

You can take a look to the following paper, from Microsoft researchers. "Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing". There, they say that in some circumstances (few in-domain data), the hybrid approach can lead to good results. But, of course, the goal of the paper is to demonstrate that with enough in-domain data, direct training with in-domain corpus is better. It must be also said that they use BERT in their experiments, not Doc2Vec. — Alberto Gil Solla, Nov 17 '20 at 08:46
I really share their opinion, but, in any case, my interest is in testing by myself different scenarios to observe results with my data. Of course, I intend to test different combinations, but, I am not an expert in Doc2Vec (not even in Machine Learning) and any comment from people with much more experience and knowledge is welcome to avoid naive errors. Thanks again for your time. Alberto — Alberto Gil Solla, Nov 17 '20 at 08:48
BERT is different & there's a a bunch of work demonstrating the exact steps & results of 'fine-tuning'. I've not seen such writeups for 'Paragraph Vectors' (`Doc2Vec`) & I wouldn't be confident that training a `Doc2Vec` model on two disjoint batches of data would necessarily be analogous to BERT's pre-training & fine-tuning steps. — gojomo, Nov 17 '20 at 17:34

How to continue training Doc2Vec with a specific domain corpus after training with a generic corpus

1 Answers1