'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

Question

I want to use Word2Vec, and i have download a Word2Vec's corpus in indonesian language, but when i call it, it was give me an error, this is what i try :

Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)

and it was give me an error, like this :

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-73-219e152ee7d9> in <module>()
----> 1 Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)

2 frames
/usr/local/lib/python3.7/dist-packages/gensim/utils.py in any2unicode(text, encoding, errors)
    353     if isinstance(text, unicode):
    354         return text
--> 355     return unicode(text, encoding, errors=errors)
    356 
    357 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

score 0 · Accepted Answer · answered Mar 23 '22 at 01:57

0

A file named idwiki_word2vec_100_new_lower.model.wv.vectors.npy is unlikely to be in the format needed by load_word2vec_format().

The .npy suggests it is a raw numpy array, which is not the format expected.

Also, the .wv.vectors. section suggests this could be part of a full, multi-file Gensim .save() of a complete Word2Vec model. That's more than just the vectors, & requires all associated files to re-load.

You should double-check the source of the vectors and what their claims are about its format and the proper ways to load. (If you're still having problems & need more guidance, you should specify more details about the origin of the file – for example a link to the website where it was obtained – to support other suggestions.)

answered Mar 23 '22 at 01:57

gojomo

52,260
14
86
115

I have using `idwiki_word2vec_100_new_lower.model.wv.vectors.npy` from [link](https://drive.google.com/file/d/1WFBBCDIssfHDeFpYgWPtj14e71vYkX6o/view) or [link](https://medium.com/@diekanugraha/membuat-model-word2vec-bahasa-indonesia-dari-wikipedia-menggunakan-gensim-e5745b98714d) I think it was a corpus of Word2Vec for indonesian language, so i use that – Ronald Ferdinand Mar 23 '22 at 02:05
Are this file is not use for `gensim.models.KeyedVectors.load_word2vec_format` ? if it wasn't, so i think i've downloading wrong file – Ronald Ferdinand Mar 23 '22 at 02:09
That Medium article shows an entire `Word2Vec` model being saved with the code `id_w2v.save('model/idwiki_word2vec_200_new_lower.model')`. That means you'd reload the entire model (which is spread across multiple files, and includes more than just the word-vectors) with code like `model = Word2Vec.load('idwiki_word2vec_200_new_lower.model')`. Nothing in the article or filename implies it is in the format that uses `load_word2vec_format()`. After the load, you will have a whole model. If you only need the word-vectors, as a `KeyedVectors` object, they'll be inside the `model.wv` variable. – gojomo Mar 23 '22 at 06:56

'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

1 Answers1