8

The lda.show_topics module from the following code only prints the distribution of the top 10 words for each topic, how do i print out the full distribution of all the words in the corpus?

from gensim import corpora, models

documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)

for i in lda.show_topics():
    print i
Braiam
  • 1
  • 11
  • 47
  • 78
alvas
  • 115,346
  • 109
  • 446
  • 738
  • You could do the hacky thing, and change the lda package in site-packages (or wherever it is on your computer) to print all of them, or copy their code for it into your program, and change it to print out all instead of 10. – debianplebian Jul 15 '13 at 20:11
  • just found the answer, it's sort of hidden in the api =). See answer below – alvas Jul 15 '13 at 20:17
  • good job finding your own answer. – debianplebian Jul 15 '13 at 20:20

3 Answers3

8

There is a variable call topn in show_topics() where you can specify the number of top N words you require from the words distribution over each topic. see http://radimrehurek.com/gensim/models/ldamodel.html

So instead of the default lda.show_topics(). You can use the len(dictionary) for the full word distributions for each topic:

for i in lda.show_topics(topn=len(dictionary)):
    print i
alvas
  • 115,346
  • 109
  • 446
  • 738
4

There are two variable call num_topics and num_words in show_topics(),for num_topics number of topics, return num_words most significant words (10 words per topic, by default). see http://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topics

So you can use the len(lda.id2word) for the full words distributions for each topic,and the lda.num_topics for the all topics in your lda model.

for i in lda.show_topics(formatted=False,num_topics=lda.num_topics,num_words=len(lda.id2word)):
    print i
  • Please explain your answer. SO doesn't just exist to answer questions, but to help people learn. Code Only answers are considered low quality – Machavity May 17 '16 at 15:29
0

The below code will print your words as well as their probability. I have printed top 10 words. You can change num_words = 10 to print more words per topic.

for words in lda.show_topics(formatted=False,num_words=10):
    print(words[0])
    print("******************************")
    for word_prob in words[1]:
        print("(",dictionary[int(word_prob[0])],",",word_prob[1],")",end = "")
    print("")
    print("******************************")
Shubham Sharma
  • 1,753
  • 15
  • 24