9

I'm relative new in the world of Latent Dirichlet Allocation. I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents. My step now is try understand how can I use a previus generated model to classify unseen documents. I'm saving my "lda_wiki_model" with

id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')

    mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')

    lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
    lda.save('lda_wiki_model.lda')

And I'm loading the same model with:

new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo

I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"

But when I run new_topics = new_lda[corpus] I receive a 'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'

how can I extract topics from that?

I already tried

`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)

and

print(corpus_lda.print_topics(num_topics=1, num_words=7) `

but that return topics not relationed to my new document. Where is my mistake? I'm miss understanding something?

**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?

Thank you.

Marco Oliveira
  • 167
  • 1
  • 10

3 Answers3

12

I was facing the same problem. This code will solve your problem:

new_topics = new_lda[corpus]

for topic in new_topics:

      print(topic)

This will give you a list of tuples of form (topic number, probability)

Lavanya
  • 135
  • 1
  • 5
3

From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:

Converting the entire corpus at the time of calling corpus_transformed = model[corpus] would mean storing the result in main memory, and that contradicts gensim’s objective of memory-independence.

If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

Hope it helps.

simone
  • 148
  • 2
  • 7
2

This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.

#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]

#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test] 

#Print results, export to csv
for topic in lda_unseen:
      print(topic)

topic_probability = []
for t in lda_test:
      #print(t)
      topic_probability.append(t)

results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
                                                       'Topic 3','Topic 4',
                                                       'Topic 5','Topic n'])

result_test.to_csv('test_results.csv', index=True, header=True)

Code inspired from this post.

Anavir
  • 33
  • 7