A huge problem with training with LdaMulticore
. It takes 2.5h to get only 25 topics. Whilst only one core is active, and I have 16 of them on Amazon EC2.
How can I optimize this?
Something is bottlenecking this process... When I take a look at processes only one core is active, but after some time all cores get active for a couple of seconds, then again one core.
numberTopics = 25 #Number of topics
model_gensim = LdaMulticore(num_topics=numberTopics,
id2word=dictionary,
iterations=10,
passes=1,
chunksize=50,
eta='auto',
workers=12)
perp_gensim = []
times_gensim = []
i=0
max_it = 5
min_prep = np.inf
start = time()
for _ in tqdm_notebook(range(100)):
model_gensim.update(corpus)
tmp = np.exp(-1 * model_gensim.log_perplexity(corpus))
perp_gensim.append(tmp)
times_gensim.append(time() - start)
if(tmp<min_prep):
min_prep = tmp;
i = 0
else:
i = i + 1;
if (i==max_it):
break
model_gensim.save('results/model_genism/model_genism.model')
with open('results/model_genism/perp_gensim.pickle', 'wb') as f:
pickle.dump(perp_gensim, f)
with open('results/model_genism/time_gensim.pickle', 'wb') as f:
pickle.dump(times_gensim, f)
for i, topic in enumerate(model_gensim.get_topics().argsort(axis=1)[:, -10:][:, ::-1], 1):
print('Topic {}: {}'.format(i, ' '.join([vocabulary[id] for id in topic])))