8

How is it possible to retrieve the n most frequent words from a Gensim word2vec model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count() method.

I need to produce a list of the n most frequent words from my word2vec model.

Edit:

I've tried the following:

w2c = dict()
for item in model.wv.vocab:
   w2c[item]=model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())

My initial guess was to use code above, but this implements the count method. I'm not sure if this represents the most frequent words.

Harshal Parekh
  • 5,918
  • 4
  • 21
  • 43
Phils19
  • 156
  • 2
  • 8
  • What have you tried? Demonstrate some effort in solving the problem and provide [MCVE](https://stackoverflow.com/help/mcve). – sophros Dec 04 '18 at 22:17

1 Answers1

15

The .count property of each vocab-entries is the count of that word as seen during the initial vocabulary-survey. So sorting by that, and taking the highest-count words, will give you the most-frequent words.

But also, for efficiency, it's typical practice for the ordered-list of known-words to be ordered from most- to least-frequent. You can view this at the list model.wv.index_to_key, so can retrieve the 100 most frequent words by model.wv.index_to_key[:100]. (In Gensim before version 4.0, this same list was called either index2entity or index2word.)

gojomo
  • 52,260
  • 14
  • 86
  • 115