There's no facility in gensim
's Word2Vec
to accept a negative value for workers
. (Where'd you get the idea that would be meaningful?)
So, it's quite possible that's breaking something, perhaps preventing any training from even being attempted.
Was there sensible logging output (at level INFO
) suggesting that training was progressing in your trial runs, either against the PathLineSentences
or your second attempt? Did utilities like top
show busy threads? Did the output suggest a particular rate of progress & let you project-out a likely finishing time?
I'd suggest using a positive workers
value and watching INFO
-level logging to get a better idea what's happening.
Unfortunately, even with 36 cores, using a corpus iterable sequence (like PathLineSentences
) puts gensim Word2Vec
in a model were you'll likely get maximum throughput with a workers
value in the 8-16 range, using far less than all your threads. But it will do the right thing, on a corpus of any size, even if it's being assembled by the iterable sequence on-the-fly.
Using the corpus_file
mode can saturate far more cores, but you should still specify the actual number of worker threads to use – in your case, workers=36
– and it is designed to work on from a single file with all data.
Your code which attempts to train()
many times with corpus_file
has lots of problems, and I can't think of a way to adapt corpus_file
mode to work on your many files. Some of the problems include:
you're only building the vocabulary from the 1st file, which means any words only appearing in other files will be unknown and ignored, and any of the word-frequency-driven parts of the Word2Vec
algorithm may be working on unrepresentative
the model builds its estimate of the expected corpus size (eg: model.corpus_total_words
) from the build_vocab()
step, so every train()
will behave as if that size is the total corpus size, in its progress-reporting & management of the internal alpha
learning-rate decay. So those logs will be wrong, the alpha
will be mismanaged in a fresh decay each train()
, resulting in a nonsensical jigsaw up-and-down alpha
over all files.
you're only iterating over each file's contents once, which isn't typical. (It might be reasonable in a giant 210-billion word corpus, though, if every file's text is equally and randomly representative of the domain. In that case, the full corpus once might be as good as iterating over a corpus that's 1/5th the size 5 times. But it'd be a problem if some words/patterns-of-usage are all clumped in certain files - the best training interleaves contrasting examples throughout each epoch and all epochs.)
min_count=1
is almost always unwise with this algorithm, and especially so in large corpora of typical natural-language word frequencies. Rare words, and especially those appearing only once or a few times, make the model gigantic but those words won't get good word-vectors, and keeping them in acts like noise interfering with the improvement of other more-common words.
I recommend:
Try the corpus iterable sequence mode, with logging and a sensible workers
value, to at least get an accurate read of how long it might take. (The longest step will be the initial vocabulary scan, which is essentially single-threaded and must visit all data. But you can .save()
the model after that step, to then later re-.load()
it, tinker with settings, and try different train()
approaches without repeating the slow vocabulary survey.)
Try aggressively-higher values of min_count
(discarding more rare words for a smaller model & faster training). Perhaps also try aggressively-smaller values of sample
(like 1e-05
, 1e-06
, etc) to discard a larger fraction of the most-frequent words, for faster training that also often improves overall word-vector quality (by spending relatively more effort on less-frequent words).
If it's still too slow, consider if you could using a smaller subsample of your corpus might be enough.
Consider the corpus_file
method if you can roll much or all of your data into the single file it requires.