The original FastText release by Facebook includes a command-line option thread
, default 12, which controls the number of worker threads which will do parallel training (on a single machine). If you have more CPU cores, and haven't yet tried increasing it, try that.
The gensim implementation (as gensim.models.fasttext.FastText
) includes an initialization parameter, workers
, which controls the number of worker threads. If you haven't yet tried increasing it, up to the number of cores, it may help. However, due to extra multithreading bottlenecks in its Python implementation, if you have a lot of cores (especially 16+), you might find maximum throughput with fewer workers than cores – often something in the 4-12 range. (You have to experiment & watch the achieved rates via logging to find the optimal value, and all cores won't be maxed.)
You'll only get significant multithreading in gensim if your installation is able to make use of its Cython-optimized routines. If you watch the logging when you install gensim via pip
or similar, there should be a clear error if this fails. Or, if you are watching logs/output when loading/using gensim classes, there will usually be a warning if the slower non-optimized versions are being used.
Finally, often in the ways people use gensim, the bottleneck can be in their corpus iterator or IO rather than the parallelism. To minimize this slowdown:
- Check to see how fast your corpus can iterate over all examples separate from passing it to the gensim class.
- Avoid doing any database-selects or complicated/regex preprocessing/tokenization in the iterator – do it once, and save the easy-to-read-as-tokens resulting corpus somewhere.
- If the corpus is coming from a network volume, test if streaming it from a local volume helps. If coming from a spinning HD, try an SSD.
- If the corpus can be made to fit in RAM, perhaps on a special-purpose giant-RAM machine, try that.