6

I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output:

DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]  
LDA 1 : [(29, 0.80571428571428572)]  
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]  
LDA 2 : [(29, 0.83809523809523812)]  
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]  
LDA 3 : [(34, 0.75714285714285712)]  
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]  
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]  
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]  
LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]  
DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]  
LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]  
DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]  
LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]  
DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]  
LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]  

Where, DOC N is the document from the BOW corpus LDA N is the transformation of DOC N by that LDA model

Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.

Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE - by the creator of Gensim himself, it's been mentioned that a document belongs to ONE topic?

Ravi Karan
  • 445
  • 1
  • 7
  • 13

1 Answers1

6

Your understanding of the output of LDA from gensim is correct. What you need to remember though is that LDA[corpus] will only output topics that exceed a certain threshold (set when you initialise the model).

The document belongs to ONE topic issue is one you need to make a decision about on your own. LDA gives you a distribution over the topics for each document you feed into it*. You need to then make a decision whether a document having (for instance) 50% of a topic is enough for that document to belong to said topic.

(*) again you have to keep in mind that LDA[corpus] will only show you those ones that exceed a threshold, not the whole distribution. You can access the whole distribution as well using

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]
Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
  • Thanks for the explanation @Matti - clears things up a lot. Unfortunately, gensim documentation is sparser than a tfidf matrix, making it difficult to find how I "make a decision" on the number of topics each document has. From [here](https://radimrehurek.com/gensim/models/ldamodel.html) it looks like `gamma_threshold` may be related, and I only see one unrelated instance of `theta` which you have mentioned. How do I make this initial decision when training the model so that when calling `lda[corpus[i]]` I get the full distribution of the `n` topics for `doc[i]` I trained the model on? – PyRsquared Jul 17 '17 at 15:45
  • if you want the full distribution I recommend you use the code above, as that is what `LdaModel` does internally, but then adds a for loop to generate the list output format - which is a waste of time if you wanted to have the full dist anyway – Matti Lyra Jul 21 '17 at 07:46