9

Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab count down to e.g. 200k words.

I found Word2Vec methods in the gensim package to determine the word frequency and to re-save the model again, but I am not sure how to pop/remove vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class and the Word2Vec class for such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the vocabulary of the pre-trained word2vec model?

neurix
  • 4,126
  • 6
  • 46
  • 71
  • I am probably late, but take a look at [this repo](https://github.com/eyaler/word2vec-slim). I think it is precisely what you are looking for. – edu_ Mar 15 '18 at 15:58

2 Answers2

8

The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.

And, load_word2vec_format() offers an optional limit parameter that only reads that many vectors from the given file.

So, the following should do roughly what you've requested:

goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    In glove, the vocabulary is sorted by frequency https://github.com/stanfordnlp/GloVe/blob/master/src/vocab_count.c#L206. You can convert glove to word2vec and then use the suggestion above. – ben26941 Jul 04 '18 at 15:34
4

Do you know about this open list/set of pretrained models - maybe an alternative one would be beneficial to the jumbo Google one? :)

https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models

I don't know how to do your precise need, but on the Google group there is some discussion on trimming models that might be of use: https://groups.google.com/forum/#!topic/gensim/wkVhcuyj0Sg

They reference a recent change also on minimising the model but I knwo that is not exactly what you want.

https://github.com/RaRe-Technologies/gensim/pull/987

Luke Barker
  • 915
  • 7
  • 14