Extremely slow LDA training model with large corpora python gensim

Question

I am currently working with 9600 documents and applying gensim LDA. For training part, the process seems to take forever to get the model. I've tried to use multicore function as well, but it seems not working. I ran whole almost 3-days and I still can not get the lda model. I've checked some features of my data and the codes. I read this question gensim LdaMulticore not multiprocessing?, but still don't get the solutions.

corpora.MmCorpus.serialize('corpus_whole.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus_whole.mm')
dictionary = gensim.corpora.Dictionary.load('dictionary_whole.dict')

dictionary.num_pos
12796870

print(corpus)
MmCorpus(5275227 documents, 44 features, 11446976 non-zero entries)

# lda model training codes
lda = models.LdaModel(corpus, num_topics=45, id2word=dictionary,\
 update_every=5, chunksize=10000,  passes=100)

ldanulti = models.LdaMulticore(corpus, num_topics=45, id2word=dictionary,\
                            chunksize=10000, passes=100, workers=3)

This is my config to check BLAS, which I am not sure I installed proper one. One thing I struggled here is, I can not use the command apt-get to install packages on my mac. I've installed Xcode but it still gives me an error.

python -c 'import scipy; scipy.show_config()'
lapack_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/Users/misun/anaconda/lib']
include_dirs = ['/Users/misun/anaconda/include']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/Users/misun/anaconda/lib']
include_dirs = ['/Users/misun/anaconda/include']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/Users/misun/anaconda/lib']
include_dirs = ['/Users/misun/anaconda/include']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
blas_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/Users/misun/anaconda/lib']
include_dirs = ['/Users/misun/anaconda/include']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]

I have poor understanding on how to use shardedcorpus in python with my dictionary and corpora, so any helps will be appreciated! I haven't slept for 3 days to figure this problem!! Thanks!!

There are some surprising numbers when you print your MmCorpus. It says more than 5 million documents although you stated only 9600. Also, there are only 44 features (distinct words?) with more than 11 million non-zero entries. Not to mention a dictionary with more than 12 million entries. Are you sure your `gensim` dictionary, corpus, etc. are OK? — WolfgangK, Apr 12 '18 at 13:08

score 3 · Accepted Answer · answered Mar 30 '18 at 14:28

I cannot really reproduce your problem on my machine but to me it looks like your problem is not multiprocessing but rather your parameter passes, which seems way too high to me. Try something like 1 or 2, which should be a good parameter to start with. If your topics don't converge well you can still increase it.

lda = models.LdaModel(corpus, num_topics=45, id2word=dictionary, update_every=5, chunksize=10000,  passes=1)

This should be through in a day at most, probably just some hours (depending on your machine).

Extremely slow LDA training model with large corpora python gensim

1 Answers1