0

The sampling_table parameter is only used in the tf.keras.preprocessing.sequence.skipgrams method once to test if the probability of the target word in the sampling_table is smaller than some random number drawn from 0 to 1 (random.random()).

If you have a large vocabulary and a sentence that uses a lot of infrequent words, doesn't this cause the method to skip a lot of the infrequent words in creating skipgrams? Given the values of a sampling_table that is log-linear like a zipf distribution, doesn't this mean you can end up with no skip grams at all?

Very confused by this. I am trying to replicate the Word2Vec tutorial hand don't understand or how the sampling_table is being used.

In the source code, this is the lines in question:

            if sampling_table[wi] < random.random():
                continue

1 Answers1

1

This looks like the frequent-word-downsampling feature common in word2vec implementations. (In the original Google word2vec.c code release, and the Python Gensim library, it's adjusted by the sample parameter.)

In practice, it's likely sampling_table has been precalculated so that the rarest words are always used, common words skipped a little, and the very-most-common words skipped a lot.

That seems to be the intent reflected by the comment for make_sample_table().

You could go ahead and call that with a probe value, like say 1000 for a 100-word vocabulary, and see what sampleing_table it gives back. I suspect it'll be numbers close to 1.0 early (drop lots of common words), and close to 0.0 late (keep most/all rare words).

This tends to improve word-vector quality, by reserving more relative attention for medium- and low-frequency words, and not exessively overtraining/overweighting plentiful words.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    thank you! you are correct. I had thought the sampling table decreased probability as position went higher, but it's the opposite. ```test = keras.preprocessing.sequence.make_sampling_table(1000) print(test[999]) #0.27343684719607847 print(test[0]) #0.27343684719607847``` – user12346170 Apr 26 '21 at 14:56
  • That value for `test[999]` looks wrong – I'd expect it to be higher, cut-and-paste slip-up? – but yes, that's the right idea. Early (high-rank) words will have lower probability, especially the very few words in the 'tall head' of the distribution... while later words (the 'long tail') will be fully-sampled (`1.0`). – gojomo Apr 26 '21 at 16:49
  • Tuning `sample` parameter to a smaller number (like `1e-06` or `1e-07`) will be more aggressive dropping most-frequent words. In true natural-language word-distributions, & with enough data, this dropping-of-more-words can counter-intuitively improve *both* training time & vector-quality - because those most-frequent words are so overrepresented, & often semantically weak (like stop words), that deemphasizing them lets other words train better. In contrast, if your data isn't a real natural language-like distribution - other kinds of log or category data - sometimes *any* downsample hurts. – gojomo Apr 26 '21 at 16:55