15

When I train my lda model as such

dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, 
workers=num_cores, alpha=1e-5, eta=5e-1)

I want to get a full topic distribution for all num_topics for each and every document. That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to be able to access all 50 topics' contribution. This output is what LDA should do if adhering strictly to the mathematics of LDA. However, gensim only outputs topics that exceed a certain threshold as shown here. For example, if I try

lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]

which shows only 3 topics that contribute most to document 89. I have tried the solution in the link above, however this does not work for me. I still get the same output:

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]

produces the same output i.e. only 2,3 topics per document.

My question is how do I change this threshold so I can access the FULL topic distribution for each document? How can I access the full topic distribution, no matter how insignificant the contribution of a topic to a document? The reason I want the full distribution is so I can perform a KL similarity search between documents' distribution.

Thanks in advance

PyRsquared
  • 6,970
  • 11
  • 50
  • 86
  • You are only seeing three topics for document 89, as there is no other topics associated with document 89. See, Sum of probabilities of topics is 1 in your example. – Harshit Mehta Sep 27 '18 at 19:55

2 Answers2

13

It doesnt seem that anyone has replied yet, so I'll try and answer this as best I can given the gensim documentation.

It seems you need to set a parameter minimum_probability to 0.0 when training the model to get the desired results:

lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
              minimum_probability=0.0)

lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
 (1, 5.8821799358842424e-07),
 (2, 5.8821799358842424e-07),
 (3, 5.8821799358842424e-07),
 (4, 5.8821799358842424e-07),
 (5, 5.8821799358842424e-07),
 (6, 5.8821799358842424e-07),
 (7, 5.8821799358842424e-07),
 (8, 5.8821799358842424e-07),
 (9, 5.8821799358842424e-07),
 (10, 5.8821799358842424e-07),
 (11, 5.8821799358842424e-07),
 (12, 5.8821799358842424e-07),
 (13, 5.8821799358842424e-07),
 (14, 5.8821799358842424e-07),
 (15, 5.8821799358842424e-07),
 (16, 5.8821799358842424e-07),
 (17, 5.8821799358842424e-07),
 (18, 5.8821799358842424e-07),
 (19, 5.8821799358842424e-07),
 (20, 5.8821799358842424e-07),
 (21, 5.8821799358842424e-07),
 (22, 5.8821799358842424e-07),
 (23, 5.8821799358842424e-07),
 (24, 5.8821799358842424e-07),
 (25, 5.8821799358842424e-07),
 (26, 5.8821799358842424e-07),
 (27, 0.99997117731831464),
 (28, 5.8821799358842424e-07),
 (29, 5.8821799358842424e-07),
 (30, 5.8821799358842424e-07),
 (31, 5.8821799358842424e-07),
 (32, 5.8821799358842424e-07),
 (33, 5.8821799358842424e-07),
 (34, 5.8821799358842424e-07),
 (35, 5.8821799358842424e-07),
 (36, 5.8821799358842424e-07),
 (37, 5.8821799358842424e-07),
 (38, 5.8821799358842424e-07),
 (39, 5.8821799358842424e-07),
 (40, 5.8821799358842424e-07),
 (41, 5.8821799358842424e-07),
 (42, 5.8821799358842424e-07),
 (43, 5.8821799358842424e-07),
 (44, 5.8821799358842424e-07),
 (45, 5.8821799358842424e-07),
 (46, 5.8821799358842424e-07),
 (47, 5.8821799358842424e-07),
 (48, 5.8821799358842424e-07),
 (49, 5.8821799358842424e-07)]
PyRsquared
  • 6,970
  • 11
  • 50
  • 86
  • What is the consequence of setting eval_every to None? Does no perplexity estimation effect model training aka parameter estimation? – PleaseHelp Mar 18 '18 at 17:10
  • @PleaseHelp setting eval_every to 1 is usefull when you logging.setLevel('DEBUG') is activated as it will give you information on the state of convergence of your algorithm. See https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb for more. – Malik Koné Apr 21 '19 at 00:01
5

In case it may help someone else:

After training your LDA model, if you want to get all topics of a document, without limiting with a lower threshold, you should set minimum_probability to 0 when calling the get_document_topics method.

ldaModel.get_document_topics(bagOfWordOfADocument, minimum_probability=0.0)
Memin
  • 3,788
  • 30
  • 31