9

I am now going through LDA(Latent Dirichlet Allocation) Topic modelling method to help in extraction of topics from a set of documents. As from what I have understood from the link below, this is an unsupervised learning approach to categorize / label each of the documents with the extracted topics.

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

In the sample code given in that link, there is a function defined to get the top words associated with each of the topic identified.

sklearn.__version__

Out[41]: '0.17'

from sklearn.decomposition import LatentDirichletAllocation 


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

My Question is this. Is there any component or matrix of the built model LDA, from where we can get the document-topic association ?

For example, I need to find top 2 topics associated with each doc as the document label / Category for that Doc. Is there any component to find distribution of topics in a document, similar to the model.components_ for finding words distribution within a topic.

Bala
  • 193
  • 1
  • 9

1 Answers1

9

You can compute the document-topic association using the transform(X) function of the LDA class.

On the example code, this would be:

doc_topic_distrib = lda.transform(tf)

with lda the fitted lda, and tf the input data you want to transform

clemgaut
  • 106
  • 1
  • 4
  • 1
    Thanks! Worked like a charm – prashanth Feb 19 '17 at 17:06
  • I'm a bit confused on this solution. I was hoping to get an output that shows the topics associated with each document. Something like this Document #1: Topic: [1, 2, 3] – moku Dec 14 '17 at 19:11
  • 2
    What you get is the distribution of topics for each document. So each line corresponds to a document and each column to a topic. To get the result you want, what you can do is look at each line and get the column indices of the three largest values. The would give you the three most important topic per document. – clemgaut Jan 10 '18 at 15:48