0

A huge problem with training with LdaMulticore. It takes 2.5h to get only 25 topics. Whilst only one core is active, and I have 16 of them on Amazon EC2. How can I optimize this?

Something is bottlenecking this process... When I take a look at processes only one core is active, but after some time all cores get active for a couple of seconds, then again one core.

numberTopics = 25   #Number of topics
model_gensim = LdaMulticore(num_topics=numberTopics,
                        id2word=dictionary,
                        iterations=10,
                        passes=1,
                        chunksize=50,
                        eta='auto',
                        workers=12)


perp_gensim = []
times_gensim = []
i=0
max_it = 5
min_prep = np.inf
start = time()
for _ in tqdm_notebook(range(100)):
    model_gensim.update(corpus)
    tmp = np.exp(-1 * model_gensim.log_perplexity(corpus))
    perp_gensim.append(tmp)
    times_gensim.append(time() - start)
    if(tmp<min_prep):
        min_prep = tmp;
        i = 0
    else:
        i = i + 1;
        if (i==max_it):
            break                
model_gensim.save('results/model_genism/model_genism.model')
with open('results/model_genism/perp_gensim.pickle', 'wb') as f:
    pickle.dump(perp_gensim, f)
with open('results/model_genism/time_gensim.pickle', 'wb') as f:
    pickle.dump(times_gensim, f)

for i, topic in enumerate(model_gensim.get_topics().argsort(axis=1)[:, -10:][:, ::-1], 1):
    print('Topic {}: {}'.format(i, ' '.join([vocabulary[id] for id in topic])))
curious95
  • 1,904
  • 3
  • 15
  • 17
  • Possibly related: https://stackoverflow.com/questions/33929680/gensim-ldamulticore-not-multiprocessing – scipilot Mar 21 '19 at 08:24
  • Is your "corpus iterator is too slow to use LdaMulticore effectively." ? like https://github.com/RaRe-Technologies/gensim/issues/288 – scipilot Mar 21 '19 at 08:28

0 Answers0