The key line is:
https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec_inner.pyx#L543
if c.sample and word.sample_int < random_int32(&c.next_random):
continue
If c.sample
tests if frequent-word downsampling is enabled at all (any non-zero value).
The word.sample_int
is a value, per vocabulary word, that was precalculated during the vocabulary-discovery phase. It's essentially the 0.0-to-1.0 probability that a word should be kept, but scaled to the range 0-to-(2^32-1).
Most words, that are never down-sampled, simply have the value (2^32-1) there - so no matter what random int was just generated, that random int is less than the threshold, and the word is retained.
The few most-frequent words have other scaled values there, and thus sometimes the random int generated is larger than their sample_int
. Thus, that word is, in that one training-cycle, skipped via the continue
to the next word in the sentence. (That one word doesn't get made part of effective_words
, this one time.)
You can see the original assignment & precalculation of the .sample_int
values, per unique vocabulary word, at and around:
https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec.py#L1544