4

I am using Gensim's LDAMulticore to perform LDA. I have around 28M small documents (around 100 characters each).

I have given workers argument to be 20 but the top shows it using only 4 processes. There are some discussions around it that it might be slow in reading corpus like: gensim LdaMulticore not multiprocessing? https://github.com/piskvorky/gensim/issues/288

But both of them uses MmCorpus . Although my corpus is completely in memory. I have machine with very large RAM (250 GB) and loading the corpus in memory takes around 40 GB. But even after that LDAMulticore is using just 4 processes. I created the corpus as:

corpus = [dictionary.doc2bow(text) for text in texts]

I am not able to understand what can be the limiting factor here?

Community
  • 1
  • 1
Naman
  • 2,569
  • 4
  • 27
  • 44
  • i have the same issue. the logger becomes silent after the line "using serial LDA version on this node" – Koustuv Sinha Jan 13 '17 at 15:36
  • I get a similar issue when running on a Mac with vecLib. I solved that using OpenBLAS. Not sure if it could be related to that (the BLAS library and the way multithreading works on your platform). – tnarik Apr 22 '17 at 10:42

1 Answers1

1

I would check what is the batch size you use

I found that in cases the Batch X n_workers is greater than number of documents, I cannot utilize all the available workers I have. This make sense as you give each worker a number of docs per pass. You might "starve" some of them if the batch value is not considered.

I am not sure it solves your specific problem, but is indeed the reason many people mentioned the multicore does not "work" as expected in terms of multiprocessing

Neal Tsur
  • 11
  • 1