0

Hello I have some word2vec models generated using Word2Vec java implementation in DL4J and saved by calling

writeWord2VecModel(Word2Vec vectors, String path)

The output of that is a zip file that contains a bunch of txt files. I can successfully load and use the model in DL4j using

Word2Vec readWord2VecModel(String path)

I am now trying to read that model in python, using gensim

import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('file_path, binary=False)

But I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 10: invalid continuation byte

I also tried with binary=True and get same results.

If I extract the model generated by DL4J I get the following files:

List Of Files

Is there a way to read that model in python genism?

Y2theZ
  • 10,162
  • 38
  • 131
  • 200

1 Answers1

1

None of the filenames shown in your image are of types gensim can read as word-vectors.

What file path and filename are you supplying to load_word2vec_format()? (None of gensim's load-methods can take a .zip archive.)

There might be another way to export the vectors from DL4J, into word2vec.c-format (text or binary, single file), rather than a full model ZIP archive.

If you succeed in that, try supplying such a single file to load_word2vec_format(), with the appropriate binary value.

(If at that point you have a right-formatted file, but you're still getting Unicode errors – perhaps later in the file – there's an optional unicode_errors='ignore' argument that can be provided to load_word2vec_format() for charging-through Unicode errors – but I don't think that's your main problem, nor would it be your problem if DL4J could export word-vectors the right way.)

gojomo
  • 52,260
  • 14
  • 86
  • 115