subsampling formula skipgram NLP

Question

I'm studying how to implement a Skip-Gram model using Pytorch, I follow this tutorial, in the subsampling part the author used this formula:

import random
import math

def subsample_prob(word, t=1e-3):
    z = freq[word_to_ix[word]] / sum_freq
    return (math.sqrt(z/t) + 1) * t/z

words_subsample = [w for w in words if random.random() < subsample_prob(w)]

where z variable is the proportion of counts of a certain word by the total of words in the corpus. my doubt is that depending on the proportion of words this formula gives a result greater than one, then the word is always added to the sub sample corpus, shouldn't it return a value between zero and one?

score 0 · Accepted Answer · answered Jun 24 '22 at 17:52

The frequent-word downsampling option ('subsampling') introduced in the original word2vec (as a -sample argument) indeed applies downsampling only to a small subset of the very-most-frequent words. (And, given the 'tall head'/Zipfian distributions of words in natural-language texts, that's plenty.)

Typical values leave most words fully sampled, as reflected in this formula by a sampling-probability greater-than 1.0.

So: there's no error here. It's how the original word2vec implementation, and others, interpret the sample parameter. Most words are exempt from any thinning, but some of the most-common words are heavily dropped. (But, there's still plenty of their varied usage examples in the training set – and indeed spending fewer training updates redundantly on those words lets other words get better vectors, facing less contention/dilution from overtraining of common words.)

subsampling formula skipgram NLP

1 Answers1