1

I am using python gensim to create word2vec for my 93 million sentences. However, when I train my model, I am getting three files as output with extensions .bin.trainables.syn1neg.npy and .bin.wv.vectors.npy in addition to .bin. I went through the answer provided here: Why are multiple model files created in gensim word2vec? which gives reasoning of why this happens. However I would like to know if there is a way to convert these files into a normal single bin file?

Arshad Shaik
  • 17
  • 10

1 Answers1

3

There is an optional parameter to .save(), called sep_limit with a default value of 10MiB, which controls the threshold over which separate files are used. You could try setting this to a much larger value – larger than any of the extra files you're seeing – and as long as your model is still small enough to not hit pickle() limits, it might work.

But, gensim is saving a model to multiple files for both efficiency, and to be sure of not htting size limitations in Python pickle(). You should if at all possible just keep the files together as a set. They will always have the same shared prefix, that you provided as a name to .save().

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    thanks for your answer...what i noticed is when I used model.save() three files were created, however when i used model.wv.save_word2vec_format() this issue was resolved – Arshad Shaik Nov 20 '18 at 05:33
  • 1
    If all you need is the word-vectors, that's a great solution! But note the native `.save()` retains more of the model's info, for example its vocabulary counts and the internal weights that would allow additional training to happen. – gojomo Nov 20 '18 at 20:02