0

While I trained d2v on a large text corpus I received these 3 files:

doc2vec.model.trainables.syn1neg.npy

doc2vec.model.vocabulary.cum_table.npy

doc2vec.model.wv.vectors.npy

Bun final model has not saved, because there was not enough free space available on the disk.

OSError: 5516903000 requested and 4427726816 written

Is there a way to resave my model using these files in a shorter time, than all training time?

Thank you in advance!

Dasha
  • 327
  • 2
  • 10

1 Answers1

1

If you still have the model in RAM, in an environment (like a Jupyter notebook) where you can run new code, you could try to clear space (or attach a new volume) and then try a .save() again. That is, you don't need to re-train, just re-save what's already in RAM.

There's no routine for saving "just what isn't already saved". So even though the subfiles that did save could potentially be valuable if you were desperate to salvage anything from the 1st training run (perhaps via a process like in my Word2Vec answer here, though it's a bit more complicated with Doc2Vec), trying another save to the same place/volume would require getting those existing files out-of-the-way. (Maybe you could transfer them to remote storage in case they'll be needed, but delete them locally to free space?)

If you try to save to a filename that ends ".gz", gensim will try to save everything compressed, which might help a little. (Unfortunately, the main vector arrays don't compress very well, so this might not be enough savings alone.)

There's no easy way to slim an already-trained model in memory, without potentially destroying some of its capabilities. (There are hard ways, but only if you're sure you can discard things a full model could do... and it's not yet clear you're in that situation.)

The major contributors to model size are the number of unique-words, and the number of unique doc-tags.

Specifying a larger min_count before training will discard more low-frequency words – and very-low-frequency words often just hurt the model anyway, so this trimming often improves three things simultaneously: faster training, smaller model, and higher-quality-results on downstream tasks.

If you're using plain-int doc-tags, the model will require vector space for all doc-tag ints from 0 to your highest number. So even if you trained just 2 documents, if they had plain-int doc-tags of 999998 and 999999, it'd still need to allocate (and save) garbage vectors for 1 million tags, 0 to 999,999. So in some cases people's memory/disk usage is higher than expected because of that – and either using contiguous IDs starting from 0, or switching to string-based doc-tags, reduces size a lot. (But, again, this has to be chosen before training.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Unfortunately, I don't have my model in RAM. About resaving model using that tutorial for word2vec, I'll try to do it, thank you, is there any major differences I should know? – Dasha Jul 18 '18 at 07:43
  • 1
    There should be extra docvecs-specific files in the `Doc2Vec` case you'd have to account for. Of the 3 files you listed as having been saved, none are the file with trained doc-vectors, so if you need those, you'll have to re-train – there's no hope of recovering those from the 3 files you've mentioned. – gojomo Jul 18 '18 at 19:15
  • 1
    You could maybe experiment with training a smaller test Doc2Vec model - save it, see all its files... including those analogous to the 3 you have. (If they're not visible, you'd have to try a larger model, bc subfiles are only used when the arrays are over a certain size.) Try deleting all but those 3, and see if you can reconstruct a usable model. If it works on the smaller test model, repeat on the larger. – gojomo Jul 18 '18 at 19:17
  • 1
    But really, the only thing sure to work is re-training and making sure you have enough space for the results to be saved. And doing it in a notebook would give you a better chance of recovering from a mid-process error, because the interim results would still be in RAM, so you could (for example) clear disk space then try another `save()`. – gojomo Jul 18 '18 at 19:17
  • 1
    Thank you for your answers, they really helped me to stop wasting my time trying to recover this model. )) Very interesting tip about doing it in a notebook, I'll try it, thanks! – Dasha Jul 19 '18 at 08:30