1

When trying to load GoogleNews-vectors-negative300.bin with pytorch Vector struct I am getting

ValueError: could not convert string to float: b'\x00\x00\x94:\x00\x00k\xba\x00\x00\x

I have tried this post (@robodasha) but without success. My goal is to build a vocabulary with the loaded embedding using build_vocab Any suggestions?

Darkmoor
  • 862
  • 11
  • 29

1 Answers1

0

Finally solved with the following way using gensim.

from gensim.models import KeyedVectors
from torchtext import data
import gensim

emb_model = KeyedVectors.load_word2vec_format(emb_bin_filename, binary=True, encoding="ISO-8859-1", unicode_errors='ignore')
word2index = {token: token_index for token_index, token in enumerate(emb_model.index2word)}
TEXT = data.Field(tokenize=my_tokenizer(), lower=lower)
TEXT.build_vocab(train_data)
TEXT.vocab.set_vectors(word2index, torch.from_numpy(emb_model.vectors).float().to(device), emb_model.vector_size)
Darkmoor
  • 862
  • 11
  • 29