I have an awfully large corpora as input to my doc2vec training, around 23mil documents streamed using an iterable function. I was wondering if it were at all possible to see the development of my training progress, possibly through finding out which iteration its currently on, words per second or some similar metric.
I was also wondering how to speed up the performance of doc2vec, other than reducing the size of the corpus. I discovered the workers parameter and I'm currently training on 4 processes; the intuition behind this number was that multiprocessing cannot take advantage of virtual cores. I was wondering if this was the case for the doc2vec workers parameter or if I could use 8 workers instead or even potentially higher (I have a quad-core processor, running Ubuntu).
I have to add that using the unix command top -H
reports only around a 15% CPU usage per python process using 8 workers and around 27% CPU usage per process on 4 workers.