0

I am using gensim Word2Vec model to train word embeddings. My code is:

w2v_model = Word2Vec(min_count=20,
                     window=2,
                     vector_size=50,
                     sample=6e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     workers=cores-1)

w2v_model.build_vocab(sentences, progress_per=10000)

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=50, report_delay=1)

I wonder whether I can access the negative and positive word samples during the process?

Thanks in advance.

1 Answers1

0

Deep inside the training loops, for each individual 'center' word in the training texts that is to be predicted – a micro-training-example for the shallow neural-net – a different set of negative words will be chosen.

Those negative-words will be used for just that one set of forward/backward neural-net nudges, then discarded when training moves to the next word.

There's no way to access them other than changing that core code – which is actually written in Cython, & re-compiled into a native library after any changes. (It's a bit harder to tinker with than pure Python code.)

You can see where the exact choice-of-negative samples happens in the source code for one of the modes (CBOW w/ negative-sampling) here:

https://github.com/RaRe-Technologies/gensim/blob/91175ddc7e3d6f3a2af245c20af21ec3bf5e360f/gensim/models/word2vec_inner.pyx#L427

If you just need a representative set of negative-words, you could copy these steps in your own code.

If you want to know (& potentially log?) the negative words chosen for every positive prediction, I suspect that's a misguided idea:

  • Meaningful analysis of this algorithm's behavior won't depend on either individual micro-examples, nor the arbitrarily-random negative words chosen over all training. The interesting properties only arise from the tug-of-war happening across the interplay of all training.
  • As this is very deep in the training loops, even the most-efficient extra-steps, as a function of the negative-words, would slow things down a lot. Or, in the case of logging, result in 20x (for window=20) more logged-negative-words than your original training corpus. For the kinds of large corpora where this algorithm works well, such a slowdown/log could be onerous; for tiny toy-sized examples, this algorithm won't be working interestingly at all.

So the mere question, if you truly want a peek at all the (random, arbitrary) negative words during the process, suggests you may be going down a questionable path.

It'd be easier for me to imagine just wanting to see a representative set of the negatively-sampled words - because any 10, or 10,000, or 1,000,000 such randomly-chosen words are as good as any other, and the algorithm (on adequately-sized data) is robust against usual variance in which negative-words are actually chosen. And for that, you could just run the same sampling-process outside the training.

Separately: those are odd non-default choices for alpha & min_alpha - values that usually don't need any tweaking, and if tweaked should really only be done so with a conscious plan, driven by quantitaive evaluations comparing the results of alternate values. But, those specific odd unmotivated values are pretty common in some of the worst online tutorials. So beware where you're learning about word2vec!

gojomo
  • 52,260
  • 14
  • 86
  • 115