2

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?

#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])

I receive an error: keyError: "word 'to' not in vocabulary" is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!

For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.

#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})

then I calculate the mean of embedding vectors of most similar words to each missing word.

missing_embd={}
for key,value in word_to_idx.items():
    if value==0:
        similar_words=model.wv.most_similar(key)
        similar_embeddings=[model.wv[a[0]] for a in similar_words]
        missing_embd[key]=mean(similar_embeddings)

And then I add these news embeddings to word2vec model by:

for word,embd in missing_embd.items():
    # model.wv.build_vocab(word,update=True)
    model.wv.syn0[model.wv.vocab[word].index]=embd

There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:

for w in tokens_lower:
    if(w in model.wv.vocab)==False:
        print(w)
        print("***********")

I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

Mahsa
  • 581
  • 1
  • 9
  • 28
  • 1
    The GoogleNews word2vec model probably excluded 'to' and 'a' due to their insignificance as stopwords. I don't think it's possible to update the model vocab since the model is generated from the C tool per the [tutorial](https://rare-technologies.com/word2vec-tutorial/) found here, but you can give it a shot with `model.build_vocab(sentences, update=True)`. – Scratch'N'Purr May 31 '18 at 07:48
  • you mean that after loading moldel I use model.build_vocab(sentences,update=True)? and now what are the embedding vectors for missing word? – Mahsa May 31 '18 at 07:53
  • Yes, you can try that, but again, I don't think its possible since Google's word2vec model was built with the C toolkit. You won't be able to get any similar embeddings for missing words since the model vocab never had these words to train on. – Scratch'N'Purr May 31 '18 at 08:12
  • Thanks for your comments. But when I use `model.build_vocab` I get this error: `AttributeError: 'KeyedVectors' object has no attribute 'build_vocab'. How can I use build_vocab? – Mahsa May 31 '18 at 08:16
  • Hmmm try this instead: `model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative33.bin', binary=True)`. If you manage to get `build_vocab` to work afterwards, you would still have to do additional training using `model.train(sentences)` – Scratch'N'Purr May 31 '18 at 08:28
  • but when I don't use of `KeyedVectors.load`, I receive this error: `raise DeprecationWarning("Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.") DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.` – Mahsa May 31 '18 at 08:40
  • That's what I was afraid of. I can't really help you then given the API change, but my gut feeling tells me that you wouldn't be able to update the model anyways even if you manage to get `build_vocab` to work. – Scratch'N'Purr May 31 '18 at 09:20

2 Answers2

2

Here is a scenario where we are adding a missing lower case word.

from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)

'Quoran' in embedding.vocab
 Output : True

'quoran' in embedding.vocab
 Output : False

Here Quoran is present but quoran in lower case is missing

# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)

'quoran' in embedding.vocab
 Output : True
Somex Gupta
  • 177
  • 2
  • 7
1

It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.

It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.

(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)

You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.

And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Now I think that I don't need to append new words to GoogleNews. Because my goal is to obtain a word embedding vector for each of the words in my corpus. I can use GoogleNews word embeddings for known words. For missing words I can use a random initialized vector with the same length to GoogleNews embeddings (300). But Can you guide me what is the best solution to generate initialized word embedding vectors? – Mahsa Jun 02 '18 at 09:26
  • 1
    If you have a good-sized corpus, train your own embeddings! There's nothing magical about the `GoogleNews` set, and if your problem domain *isn't* the same sort of news-articles it was trained on, your native embeddings might be better. If you have a good training corpus but still need to bootstrap word-vectors for later words out-of-vocabulary (OOV), consider using something like Facebook's FastText, a advanced word2vec variant that can approximate vectors for new words by composing subword vectors learned from the original corpus. – gojomo Jun 03 '18 at 00:32
  • 1
    Such FastText OOV-generated-vectors are quite rough – but better than random, and perhaps quite good for typos and word-form-variants, in languages where such word-structure/word-roots provide strong hints as to meaning. Facebook has also released pre-trained FastText vectors (https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md), analogous to GoogleNews though only trained on Wikipedia, for many languages. – gojomo Jun 03 '18 at 00:33
  • Do you know the range to initialize the vectors? I read that just initializating randomly the vectors is enough to give the word a valid vector. Do you know how to initialize the vectors for a new word in GoogleNews-vectors-negative300? – John Barton Sep 23 '19 at 23:22
  • @JuanPerez, these models typically initialize word-vectors to low-magnitude random positions – but then it is only via training, alongside other words from varied usage examples, that the word-vectors move to positions that are useful. (Just adding random vectors isn't helpful.) You'd have to say more about what you're trying to achieve to give a good answer – and that probably deserves a new question specific to your needs. – gojomo Sep 24 '19 at 00:48
  • Thanks a lot @gojomo, I am tailoring the question to post it. – John Barton Sep 24 '19 at 05:01