Building dictionary with GoogleNews-vectors-negative300.bin returns ValueError: could not convert string to float

Question

When trying to load GoogleNews-vectors-negative300.bin with pytorch Vector struct I am getting

ValueError: could not convert string to float: b'\x00\x00\x94:\x00\x00k\xba\x00\x00\x

I have tried this post (@robodasha) but without success. My goal is to build a vocabulary with the loaded embedding using build_vocab Any suggestions?

Darkmoor · Answer 1 · 2019-10-04T16:26:24.193

Finally solved with the following way using gensim.

from gensim.models import KeyedVectors
from torchtext import data
import gensim

emb_model = KeyedVectors.load_word2vec_format(emb_bin_filename, binary=True, encoding="ISO-8859-1", unicode_errors='ignore')
word2index = {token: token_index for token_index, token in enumerate(emb_model.index2word)}
TEXT = data.Field(tokenize=my_tokenizer(), lower=lower)
TEXT.build_vocab(train_data)
TEXT.vocab.set_vectors(word2index, torch.from_numpy(emb_model.vectors).float().to(device), emb_model.vector_size)

Building dictionary with GoogleNews-vectors-negative300.bin returns ValueError: could not convert string to float

1 Answers1