23

I have trained a Word2Vec model using Gensim 3.8.0. Later I tried to use the pretrained model using Gensim 4.0.o on GCP. I used the following code:

model = KeyedVectors.load_word2vec_format(wv_path, binary= False)
words = model.wv.vocab.keys()
self.word2vec = {word:model.wv[word]%EMBEDDING_DIM for word in words}

I was getting error that "model.mv" has been removed from Gensim 4.0.0. Then I used the following code:

model = KeyedVectors.load_word2vec_format(wv_path, binary= False)
words = model.vocab.keys()
word2vec = {word:model[word]%EMBEDDING_DIM for word in words}

And getting the following error:

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

Can anyone please suggest that how can I use the pretrained model & return a dictionary in Gensim 4.0.0?

3 Answers3

29

The changes caused by the migration from Gensim 3.x to 4 are all present in the github link:

https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

For the above problem, the solution that worked for me:

    words = list(model.wv.index_to_key)
Debangan Mandal
  • 291
  • 2
  • 2
8

The migration notes explain major changes & how to adapt your code:

https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

Per the guidance there, to just get a list of the words, since your model variable is already an instance of KeyedVectors, you can use:

model.index_to_key

Your code doesn't show a need for a dict, but there is a slightly-different word-to-index-position dict in model.key_to_index. However, you can just use model[key] like before to get individual vectors.

(Separately: I can't imagine your %EMBEDDING_DIM is doing anything useful. Why would you want to perform an elementwise % modulus operation, using the integer count of dimensions, against individual dimensions that are often small floating-point numbers? It'll often be harmless, as the EMBEDDING_DIM will usually be far larger than the individual values, but it doesn't serve any good purpose.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
2

On gensim 4.0.0 you will need to use the key_to_index method from the KeyedVector of your model, that will return you a dict_keys object with all the words -keys- on the model so you can still iterate through all your vocabulary :).

Your code should be now like this:

model = KeyedVectors.load_word2vec_format(wv_path, binary= False)
words = list(model.wv.key_to_index.keys())
self.word2vec = {word:model.wv[word]%EMBEDDING_DIM for word in words}
Liliana
  • 21
  • 4