0

I am using Gensim's Mallet wrapper for topic modeling -

LdaMallet(path_to_mallet_binary, corpus=corpus, num_topics=100, id2word=words, workers=6, random_seed=2)

While the above worked surprisingly fast, the step (see below) to obtain the topic distribution for each document (n=40,000) is taking a very long time.

#Store topic distributuon for all documents
all_topics=[]
for x in tqdm(range(0, len(doc_list))):
    all_topics.append(lda_model[corpus[x]])

It has taken ~18 hours to complete 30,000 documents. Not sure what I am doing incorrectly. Is there a way to get topic distribution for all documents much faster?

SanMelkote
  • 228
  • 2
  • 12

2 Answers2

0

I was able to speed-up by directly calling the Java mallet through Python's subprocess. The doc-topics distribution are available in a file that can be easily imported to a dataframe. The gensim wrapper is although straightforward, seems to have issues.

SanMelkote
  • 228
  • 2
  • 12
0

it turns out the time was took by loading the LdaMallet model mostly, I was able to generate 50,000 topic distributions in just 4 mins when I did it once for all instead of doing one by one (it took the same time before as you did).

corpus = [dictionary.doc2bow(preprocess(unseen_document)) for unseen_document in unseen_documents] distributions = mallet_model[corpus]

You could refer to https://github.com/RaRe-Technologies/gensim/issues/3018