Gensim (word2vec) retrieve n most frequent words

Question

How is it possible to retrieve the n most frequent words from a Gensim word2vec model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count() method.

I need to produce a list of the n most frequent words from my word2vec model.

Edit:

I've tried the following:

w2c = dict()
for item in model.wv.vocab:
   w2c[item]=model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())

My initial guess was to use code above, but this implements the count method. I'm not sure if this represents the most frequent words.

What have you tried? Demonstrate some effort in solving the problem and provide [MCVE](https://stackoverflow.com/help/mcve). — sophros, Dec 04 '18 at 22:17

gojomo · Accepted Answer · 2021-07-09T15:08:18.707

The .count property of each vocab-entries is the count of that word as seen during the initial vocabulary-survey. So sorting by that, and taking the highest-count words, will give you the most-frequent words.

But also, for efficiency, it's typical practice for the ordered-list of known-words to be ordered from most- to least-frequent. You can view this at the list model.wv.index_to_key, so can retrieve the 100 most frequent words by model.wv.index_to_key[:100]. (In Gensim before version 4.0, this same list was called either index2entity or index2word.)

Gensim (word2vec) retrieve n most frequent words

1 Answers1