Why is '[UNK]' word the first in word2vec vocabulary?

Question

If the vocabulary is ordered from the more frequent word to the less frequent, placing '[UNK]' at the beginning means that it occurs most. But what if '[UNK]' isn't the most frequent word? Should I put it at another place in the vocabulary, according to its frequency?

I found such issue when doing this tutorial -> https://www.tensorflow.org/tutorials/text/word2vec

When I'm doing negative sampling using the function tf.random.log_uniform_candidate_sampler, the negative samples with low token (s.g. 0,1,2 ...) will be sampled most. If '[UNK]' is the first (or second when using padding) in the vocabulary, which means that it has token 0 (or 1 when using padding), then the '[UNK]' will be heavily sampled as negative sample. If '[UNK]' happens a lot, there is no problem, but what if it doesn't? Then it should receive a higher token, shouldn't?

In the word2vec libraries with which I'm most familiar – Google's original `word2vec.c` code, & the Python Gensim library – no such synthetic `'[UNK]'` token is created, and it's more typical to ignore words outside some established vocabulary of not-very-rare words. So it'd help if you were more specific about where you're seeing such behavior – which code/project? (Any answer as to why it's there or how it should be best handled is likely specific to there.) — gojomo, Jul 19 '21 at 14:36
Thanks @gojomo. I've edited the question. Hope that it is more specific now. — Ricardo, Jul 20 '21 at 16:37

mverkruyse · Accepted Answer · 2023-08-24T21:03:57.767

0

The method which TextVectorization.get_vocabulary() calls will always put padding and the "OOV" characters as the first elements in the vector, which would imply that they're the most common as you've mentioned.

Not sure why it was written that way, as the OOV may not always be the most frequent as you've mentioned, but that's how it was implmented:

Source: https://github.com/keras-team/keras/blob/v2.13.1/keras/layers/preprocessing/index_lookup.py#L370

However, in order to ensure that it (or any other stop-words) are not oversampled as you mentioned you were concerned about, the tutorial does show how to use the "tf.keras.preprocessing.sequence.make_sampling_table" function in order to downweight the probability that items earlier in the vocabulary will not be oversampled.

In order to simply not use the OOV character in the vocab you can always exclude it as well:

inverse_vocab = vectorize_layer.get_vocabulary(include_special_tokens=False)

Seems like you could manually shuffle the "[UNK]" value to its appropriate index too if you wanted it to be as accurate as possible as you suggested.

edited Aug 24 '23 at 21:03

answered Aug 22 '23 at 23:11

mverkruyse

45
4

1

Ok, but it just explains how to put or not the [UNK] in your vocabulary. If you decide to put it (i.e., "include_special_tokens = True"), then it will stay at first position EVEN WHEN IT ISN'T THE MOST FREQUENT WORD. – Ricardo Aug 24 '23 at 16:03
Sorry I misread your original question, I've updated the answer to be more reflective of that. – mverkruyse Aug 24 '23 at 20:57
I think that your question is the best way to be precise about put the OOV at the right position. But it has to be done manually. I think that the vectorize_layer.get_vocabulary() method could be improved to compute it directly. As it doesn't happen, your answer can be a work around solution. – Ricardo Aug 26 '23 at 18:51

Why is '[UNK]' word the first in word2vec vocabulary?

1 Answers1