0

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:

['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin']. 

If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.

I did also

model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True) 
model.train(sentences_with_OOV_words)

However, tensorflow 2 returns an error.

Any help would be greatly appreciate it.

chikitin
  • 762
  • 6
  • 28
  • If the vocab is not found, you can initialize them with a zero-vector (i.e. vectors of 300 dimensions, all 0). – Toukenize Sep 16 '19 at 04:22
  • Do you mean, I create a Child class of gensim.models.keyedvectors.Word2VecKeyedVectors, then override 'get_vec' method there? if so, were can I find the implementation? Thank you. – chikitin Sep 16 '19 at 04:53
  • I think you can just do a `try` and `except` instead of creating a child class. See my answer. – Toukenize Sep 16 '19 at 05:13

3 Answers3

1

If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):

try:
    word_vector = model.wv.get_vector('your_word_here')

except KeyError:
    word_vector = np.zeros((300,))
Toukenize
  • 1,390
  • 1
  • 7
  • 11
1

Awesome! Thank you very much.

def get_vectorOOV(s):
  try:
    return np.array(model.wv.get_vector(s))
  except KeyError:
    return np.zeros((300,))
chikitin
  • 762
  • 6
  • 28
1

The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.

(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)

You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.

But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thank you for the recap. I think in my case all 0 vector, was the way to go and I got 100% accuracy on my unseen testing set! Now you made me wonder! – chikitin Sep 16 '19 at 20:28