How does Gensim implement subsampling in Word2Vec?

Question

I am trying to reimplement wor2vec in pytorch. I implemented subsamping according to the code of the original paper. However, I am trying to understand how subsampling is implemented in Gensim. I looked at the source code, but I did not manage to grasp how it reconnects to the original paper.

Thanks a lot in advance.

score 2 · Accepted Answer · answered Nov 25 '19 at 19:42

The key line is:

https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec_inner.pyx#L543

    if c.sample and word.sample_int < random_int32(&c.next_random):
        continue

If c.sample tests if frequent-word downsampling is enabled at all (any non-zero value).

The word.sample_int is a value, per vocabulary word, that was precalculated during the vocabulary-discovery phase. It's essentially the 0.0-to-1.0 probability that a word should be kept, but scaled to the range 0-to-(2^32-1).

Most words, that are never down-sampled, simply have the value (2^32-1) there - so no matter what random int was just generated, that random int is less than the threshold, and the word is retained.

The few most-frequent words have other scaled values there, and thus sometimes the random int generated is larger than their sample_int. Thus, that word is, in that one training-cycle, skipped via the continue to the next word in the sentence. (That one word doesn't get made part of effective_words, this one time.)

You can see the original assignment & precalculation of the .sample_int values, per unique vocabulary word, at and around:

https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec.py#L1544

Thank you very much for the explanation @gojomo – Pietro Nov 27 '19 at 17:44 — Pietro, Nov 27 '19 at 17:44

How does Gensim implement subsampling in Word2Vec?

1 Answers1