Importing GloVe vectors into gensim. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte

Question

I've produced GloVe vectors using the code provided by https://github.com/stanfordnlp/GloVe/blob/master/demo.sh using my own corpus. So, I have both the .bin file and .txt file vectors. I'm trying to import these files into gensim so I can work with them like I can word2vec vectors.

I've tried changing to load using both the binary format and text file format but only ended up getting a pickling error:

models = gensim.models.Word2Vec.load(file)

I've tried ignoring the unicode error, which didn't work. I still got the unicode error.

model = gensim.models.KeyedVectors.load_word2vec_format(file, binary=True, unicode_errors='ignore')

This is what I have for my code right now:

from gensim.models import KeyedVectors
import gensim
from gensim.models import word2vec

file = 'vectors.bin'
model = KeyedVectors.load_word2vec_format(file, binary=True, unicode_errors='ignore')  
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

This is the error message I keep getting:

Traceback (most recent call last):
  File "glove_to_word2vec.py", line 6, in <module>
    model = KeyedVectors.load_word2vec_format(file, binary=True)  # C  binary format
  File "/home/users/epair/.local/lib/python3.6/site- packages/gensim/models/keyedvectors.py", line 1498, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/home/users/epair/.local/lib/python3.6/site-packages/gensim/models/utils_any2vec.py", line 343, in _load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/home/users/epair/.local/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0:  invalid continuation byte

The pickling error was something like this: Unpickling Error while using Word2Vec.load()

Text file format

See https://stackoverflow.com/a/55709195/251674 – Manuel Alves Jul 22 '21 at 09:53 — Manuel Alves, Jul 22 '21 at 09:53

score 0 · Answer 1 · answered Oct 29 '19 at 19:12

There's no expectation a plain .load() would work – that will only work with gensim's own models, saved with the matching .save() method.

However, .load_word2vec_format() should work with files in the right format.

Are you sure the file is in a compatible format? (Does it load into the original Google word2vec.c sibling tools, like the distance or word-analogy executables?)

You mentioned having the .txt format as well – have you tried loading that file (with binary=False)?

Looking at line 343 of utils_any2vec.py (in a version of gensim you're likely using), that appears to be reading the very 1st line of the file, which should only have 2 plain space-separated numbers on it: the number of words, and the number of dimensions. (That is, encoding issues with regard to your actual word-tokens shouldn't even be involved.)

If you look at your file with head -1 vectors.txt, is that all you see? (If not, your GLoVe code isn't writing the right compatible format.)

I've included the format of the text file. When I try loading it with that file I get the error ValueError: invalid literal for int() with base 10: "'kenya',". — epair, Oct 30 '19 at 17:08
Yep, it's missing the required, typical declaration of the number of vectors & size of each vector. If your GLoVe code can't be fixed/tweaked to export in the right format, you could conceivably hand-patch the `.txt` file to make it work. If there were 50,000 words of 300 dimensions each, you'd want to edit the file to prepend a line `50000 300\n` (where `\n` is a newline). But also: as long as you're using `gensim` for later operations, you may want to train your word-vectors from your text there, too. It might be faster, & would offer more tweakable options, & alt algorithms like FastText. — gojomo, Oct 31 '19 at 00:13

Importing GloVe vectors into gensim. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte

1 Answers1