word2vec, using document body or keywords as training corpus

Question

I would like to train a word2vec model using what is an unordered list of keywords and categories for each document. Therefore my vocabulary is quite small around 2.5k tokens.

Would the performance be improved if at the training step, I used actual sentences from the document?

From example:

doc_keywords = ['beach', 'holiday', 'warm']
doc_body = 'Going on a beach holiday it can be very warm'

If there is a benefit to using the full documents, could someone also explain why this is the case?

Since the model predicts the next word in a document, what would be the benefit to it learning very -> warm as two words which often come together, given that very is not in my vocabulary.

score 1 · Answer 1 · answered Apr 27 '20 at 17:51

My notes can be summarized in the following points:

First of all, I don't think passing a list of keywords would be any help to the gensim.models.Word2Vec model. As you said, the reason behind using word2vec is to somehow get a feeling of the surrounding words; How can it do this job with a random list of keywords?
Second of all, the vocabulary should be the same words in the documents. So, your vocabulary should have very in it.
The more data you use, the more useful the model becomes. So, 2500 tokens aren't big enough. For example, the first version of word2vec was the Skipgram model published in 2014/2015 by Google. The vocabulary that Google used was about 692,000 words.
There are two versions of word2vec that can be used: "Skipgram" and "Continuous Bag of Words (CBOW)". Both depend on the surrounding words. You can check my answer here for more information on how each one of them works.

score 1 · Accepted Answer · answered Apr 27 '20 at 18:18

Your dataset seems quite small – perhaps too small to expect good word2vec vectors. But, a small dataset at least means it shouldn't take too much time to try things in many different ways.

So, the best answer (and the only one that truly takes into account whatever uniqueness might be in your data & project goals): do you get better final word-vectors, for your project-specific needs, when training on just the keywords, or the longer documents?

Two potential sources of advantage from using the full texts:

Those less-interesting words might still help tease-out subtleties of meaning in the full vector space. For example, a contrast between 'warm' and 'hot' might become clearer when those words are forced to predict other related words that co-occur with each in different proportions. (But, such qualities of word2vec vectors require lots of subtly-varied real usage examples – so such a benefit might not be possible in a small dataset.)
Using the real texts preserves the original proximity-influences – words nearer each other have more influence. The keywords-only approach might be scrambling those original proximities, depending on how you're turning raw full texts into your reduced keywords. (In particular, you definitely do not want to always report keywords in some database-sort order – as that would tend to create a spurious influence between keywords that happen to sort next-to each other, as opposed to appear next-to each other in natural language.)

On the other hand, including more words makes the model larger & the training slower, which might limit the amount of training or experiments you can run. And, keeping very-rare words – that don't have enough varied usage examples to get good word-vectors themselves – tends to act like 'noise' that dilutes the quality of other word-vectors. (That's why dropping rare words, with a min_count similiar to its default of 5 – or larger in larger corpuses – is almost always a good idea.)

So, there's no sure answer for which will be better: different factors, and other data/parameter/goals choices, will pull different ways. You'll want to try it in multiple ways.

Thanks! The `build_vocab` step is broken out, so what happens in model training when I build a vocab with say just my keywords, but then train on full texts, how does the model handle the out of vocab words? — dendog, Apr 28 '20 at 16:56
If a word wasn't added to the tracked-vocabulary during the initial survey (`build_vocab()`), it will be completely ignored during training – essentially stripped before the innermost training loops happen. — gojomo, Apr 28 '20 at 18:58

word2vec, using document body or keywords as training corpus

2 Answers2