Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

Question

I am using Gensim's LDAMulticore to perform LDA. I have around 28M small documents (around 100 characters each).

I have given workers argument to be 20 but the top shows it using only 4 processes. There are some discussions around it that it might be slow in reading corpus like: gensim LdaMulticore not multiprocessing? https://github.com/piskvorky/gensim/issues/288

But both of them uses MmCorpus . Although my corpus is completely in memory. I have machine with very large RAM (250 GB) and loading the corpus in memory takes around 40 GB. But even after that LDAMulticore is using just 4 processes. I created the corpus as:

corpus = [dictionary.doc2bow(text) for text in texts]

I am not able to understand what can be the limiting factor here?

i have the same issue. the logger becomes silent after the line "using serial LDA version on this node" — Koustuv Sinha, Jan 13 '17 at 15:36
I get a similar issue when running on a Mac with vecLib. I solved that using OpenBLAS. Not sure if it could be related to that (the BLAS library and the way multithreading works on your platform). — tnarik, Apr 22 '17 at 10:42

score 1 · Answer 1 · answered Nov 15 '19 at 10:52

I would check what is the batch size you use

I found that in cases the Batch X n_workers is greater than number of documents, I cannot utilize all the available workers I have. This make sense as you give each worker a number of docs per pass. You might "starve" some of them if the batch value is not considered.

I am not sure it solves your specific problem, but is indeed the reason many people mentioned the multicore does not "work" as expected in terms of multiprocessing

Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

1 Answers1