6

I'm following along with the scikit-learn LDA example here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here but don't see where I could get this number. Has anyone been able to do this before with scikit-learn?

user139014
  • 1,445
  • 2
  • 19
  • 33
  • Although I haven't used LDA from scikit, I understand that the fit_transform method returns a numpy array of shape [n_samples, n_features_new]. n_features_new should be the number of topics you set in the constructor and they represent the 'amount' of each topic on each document (i.e. a topic belongs to more than one topics). You should get the index of the maximum value in the returned array as the most probable topic of the document. – Stergios Feb 08 '16 at 17:19

1 Answers1

6

LDA calculates a list of topic probabilities for each document, so you may want to interpret the topic of a document as the topic with highest probability for that document.

If dtm is your document-term matrix and lda your Latent Dirichlet Allocation object , you can explore the topic mixtures with the transform() function and pandas:

docsVStopics = lda.transform(dtm)
docsVStopics = pd.DataFrame(docsVStopics, columns=["Topic"+str(i+1) for i in range(N_TOPICS)])
print("Created a (%dx%d) document-topic matrix." % (docsVStopics.shape[0], docsVStopics.shape[1]))
docsVStopics.head()

You can easily find the most likely topic for each document:

most_likely_topics = docsVStopics.idxmax(axis=1)

then get the counts:

 most_likely_topics.groupby(most_likely_topics).count()
Patrizio G
  • 362
  • 3
  • 13