Gensim 3.8.0 to Gensim 4.0.0

Question

I have trained a Word2Vec model using Gensim 3.8.0. Later I tried to use the pretrained model using Gensim 4.0.o on GCP. I used the following code:

model = KeyedVectors.load_word2vec_format(wv_path, binary= False)
words = model.wv.vocab.keys()
self.word2vec = {word:model.wv[word]%EMBEDDING_DIM for word in words}

I was getting error that "model.mv" has been removed from Gensim 4.0.0. Then I used the following code:

model = KeyedVectors.load_word2vec_format(wv_path, binary= False)
words = model.vocab.keys()
word2vec = {word:model[word]%EMBEDDING_DIM for word in words}

And getting the following error:

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

Can anyone please suggest that how can I use the pretrained model & return a dictionary in Gensim 4.0.0?

score 29 · Answer 1 · answered May 09 '21 at 06:08

The changes caused by the migration from Gensim 3.x to 4 are all present in the github link:

https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

For the above problem, the solution that worked for me:

    words = list(model.wv.index_to_key)

gojomo · Answer 2 · 2021-04-12T16:24:13.643

The migration notes explain major changes & how to adapt your code:

https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

Per the guidance there, to just get a list of the words, since your model variable is already an instance of KeyedVectors, you can use:

model.index_to_key

Your code doesn't show a need for a dict, but there is a slightly-different word-to-index-position dict in model.key_to_index. However, you can just use model[key] like before to get individual vectors.

(Separately: I can't imagine your %EMBEDDING_DIM is doing anything useful. Why would you want to perform an elementwise % modulus operation, using the integer count of dimensions, against individual dimensions that are often small floating-point numbers? It'll often be harmless, as the EMBEDDING_DIM will usually be far larger than the individual values, but it doesn't serve any good purpose.)

Liliana · Answer 3 · 2021-12-04T19:05:30.157

On gensim 4.0.0 you will need to use the key_to_index method from the KeyedVector of your model, that will return you a dict_keys object with all the words -keys- on the model so you can still iterate through all your vocabulary :).

Your code should be now like this:

model = KeyedVectors.load_word2vec_format(wv_path, binary= False)
words = list(model.wv.key_to_index.keys())
self.word2vec = {word:model.wv[word]%EMBEDDING_DIM for word in words}

Gensim 3.8.0 to Gensim 4.0.0

3 Answers3

Linked