How to visualize Gensim Word2vec Embeddings in Tensorboard Projector

Question

Following gensim word2vec embedding tutorial, I have trained a simple word2vec model:

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.save("/content/word2vec.model")

I would like to visualize it using the Embedding Projector in TensorBoard. There is another straightforward tutorial in gensim documentation. I did the following in Colab:

!python3 -m gensim.scripts.word2vec2tensor -i /content/word2vec.model -o /content/my_model

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 94, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 68, in word2vec2tensor
    model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py", line 172, in _load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 355, in any2unicode
    return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Please note that I did check first this exact same question from 2018 - but the accepted answer no longer works as both in gensim and tensorflow have been updated so I considered it was worth asking again in Q4 2021.

Can you be more specific about how the old info "no longer works"? (Does it hit specific errors? Gie results that look wrong? etc) If you show any specific error in your question, there may be trivial code updates that can resolve it, for either package – such as the various tips given in the Gensim 4 migration guide: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4). — gojomo, Sep 19 '21 at 18:01
Could you please refer this [doc](https://notebook.community/mattilyra/gensim/docs/notebooks/Tensorboard_visualizations), hope it helps.Thanks — , Sep 22 '21 at 13:36

user1635327 · Accepted Answer · 2021-09-27T18:40:39.277

1

Saving the model in the original C word2vec implementation format resolves the issue: model.wv.save_word2vec_format("/content/word2vec.model"):

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("/content/word2vec.model")

There are two formats of storing word2vec models in gensim: keyed vector format from the original word2vec implementation and format that additionally stores hidden weights, vocabulary frequencies, and more. Examples and details can be found in the documentation. The script word2vec2tensor.py uses the original format and loads the model with load_word2vec_format: code.

edited Sep 27 '21 at 18:40

answered Sep 26 '21 at 11:27

user1635327

1,469
3
11

Can you provide an end-to-end runnable answer including a brief explanation of the issue? – G. Macia Sep 26 '21 at 20:02
I've added the details. – user1635327 Sep 27 '21 at 19:11
@user1635327 is there any way to apply this in FastText model? – John Angelopoulos Jul 29 '22 at 08:57

How to visualize Gensim Word2vec Embeddings in Tensorboard Projector

1 Answers1