I'm studying how to implement a Skip-Gram model using Pytorch, I follow this tutorial, in the subsampling part the author used this formula:
import random
import math
def subsample_prob(word, t=1e-3):
z = freq[word_to_ix[word]] / sum_freq
return (math.sqrt(z/t) + 1) * t/z
words_subsample = [w for w in words if random.random() < subsample_prob(w)]
where z
variable is the proportion of counts of a certain word by the total of words in the corpus. my doubt is that depending on the proportion of words this formula gives a result greater than one, then the word is always added to the sub sample corpus, shouldn't it return a value between zero and one?