The sampling_table parameter
is only used in the tf.keras.preprocessing.sequence.skipgrams
method once to test if the probability of the target word in the sampling_table
is smaller than some random number drawn from 0 to 1 (random.random()
).
If you have a large vocabulary and a sentence that uses a lot of infrequent words, doesn't this cause the method to skip a lot of the infrequent words in creating skipgrams? Given the values of a sampling_table that is log-linear like a zipf distribution, doesn't this mean you can end up with no skip grams at all?
Very confused by this. I am trying to replicate the Word2Vec tutorial hand don't understand or how the sampling_table
is being used.
In the source code, this is the lines in question:
if sampling_table[wi] < random.random():
continue