I am using python gensim to create word2vec for my 93 million sentences. However, when I train my model, I am getting three files as output with extensions .bin.trainables.syn1neg.npy and .bin.wv.vectors.npy in addition to .bin. I went through the answer provided here: Why are multiple model files created in gensim word2vec? which gives reasoning of why this happens. However I would like to know if there is a way to convert these files into a normal single bin file?
Asked
Active
Viewed 1,063 times
1 Answers
3
There is an optional parameter to .save()
, called sep_limit
with a default value of 10MiB, which controls the threshold over which separate files are used. You could try setting this to a much larger value – larger than any of the extra files you're seeing – and as long as your model is still small enough to not hit pickle()
limits, it might work.
But, gensim
is saving a model to multiple files for both efficiency, and to be sure of not htting size limitations in Python pickle()
. You should if at all possible just keep the files together as a set. They will always have the same shared prefix, that you provided as a name to .save()
.

gojomo
- 52,260
- 14
- 86
- 115
-
1thanks for your answer...what i noticed is when I used model.save() three files were created, however when i used model.wv.save_word2vec_format() this issue was resolved – Arshad Shaik Nov 20 '18 at 05:33
-
1If all you need is the word-vectors, that's a great solution! But note the native `.save()` retains more of the model's info, for example its vocabulary counts and the internal weights that would allow additional training to happen. – gojomo Nov 20 '18 at 20:02