Working with google word2vec .bin files in gensim python

Question

I’m trying to get started by loading the pretrained .bin files from the google word2vec site ( freebase-vectors-skipgram1000.bin.gz) into the gensim implementation of word2vec. The model loads fine,

using ..

model = word2vec.Word2Vec.load_word2vec_format('...../free....-en.bin', binary= True)

and creates a

>>> print model
<gensim.models.word2vec.Word2Vec object at 0x105d87f50>

but when I run the most similar function. It cant find the words in the vocabulary. My error code is below.

Any ideas where I’m going wrong?

>>> model.most_similar(['girl', 'father'], ['boy'], topn=3)
2013-10-11 10:22:00,562 : WARNING : word ‘girl’ not in vocabulary; ignoring it
2013-10-11 10:22:00,562 : WARNING : word ‘father’ not in vocabulary; ignoring it
2013-10-11 10:22:00,563 : WARNING : word ‘boy’ not in vocabulary; ignoring it
Traceback (most recent call last):
File “”, line 1, in
File “/....../anaconda/python.app/Contents/lib/python2.7/site-packages/gensim-0.8.7/py2.7.egg/gensim/models/word2vec.py”, line 312, in most_similar
raise ValueError(“cannot compute similarity with no input”)
ValueError: cannot compute similarity with no input

score 7 · Answer 1 · answered Nov 20 '13 at 17:25

The words in '...../free....-en.bin' have the form of

en/boardwalk_chapel en/mutsu_munemitsu en/goffstown en/yaw_axis en/john_e_fogarty_international_center en/francielle_manoel_alberto en/shinji_harada

So when you look for 'girl' it is not there

score 2 · Answer 2 · answered Jun 10 '15 at 16:16

To expand a bit on Sergio's answer, the "words" are actually Freebase identifiers, so "girl" is represented by either /en/girl (for freebase-vectors-skipgram1000-en.bin.gz) or its MID equivalent /m/05r655 (for freebase-vectors-skipgram1000.bin.gz)

https://www.freebase.com/m/05r655

https://www.freebase.com/en/girl

Working with google word2vec .bin files in gensim python

2 Answers2