Handling OOV words in GoogleNews-vectors-negative300.bin

Question

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:

['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].

If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.

I did also

model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True) 
model.train(sentences_with_OOV_words)

However, tensorflow 2 returns an error.

Any help would be greatly appreciate it.

If the vocab is not found, you can initialize them with a zero-vector (i.e. vectors of 300 dimensions, all 0). — Toukenize, Sep 16 '19 at 04:22
Do you mean, I create a Child class of gensim.models.keyedvectors.Word2VecKeyedVectors, then override 'get_vec' method there? if so, were can I find the implementation? Thank you. — chikitin, Sep 16 '19 at 04:53
I think you can just do a `try` and `except` instead of creating a child class. See my answer. — Toukenize, Sep 16 '19 at 05:13

score 1 · Answer 1 · answered Sep 16 '19 at 05:12

1

If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):

try:
    word_vector = model.wv.get_vector('your_word_here')

except KeyError:
    word_vector = np.zeros((300,))

answered Sep 16 '19 at 05:12

Toukenize

1,390
1
7
11

chikitin · Accepted Answer · 2019-09-16T14:13:35.503

1

Awesome! Thank you very much.

def get_vectorOOV(s):
  try:
    return np.array(model.wv.get_vector(s))
  except KeyError:
    return np.zeros((300,))

edited Sep 16 '19 at 14:13

answered Sep 16 '19 at 14:05

chikitin

762
6
28

score 1 · Answer 3 · answered Sep 16 '19 at 18:41

The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.

(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)

You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.

But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)

Thank you for the recap. I think in my case all 0 vector, was the way to go and I got 100% accuracy on my unseen testing set! Now you made me wonder! — chikitin, Sep 16 '19 at 20:28

Handling OOV words in GoogleNews-vectors-negative300.bin

3 Answers3