8

I created a Gensim LDA Model as shown in this tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

lda_model = gensim.models.LdaMulticore(data_df['bow_corpus'], num_topics=10, id2word=dictionary, random_state=100, chunksize=100, passes=10, per_word_topics=True)

And it generates 10 topics with a log_perplexity of:

lda_model.log_perplexity(data_df['bow_corpus']) = -5.325966117835991

But when I run the coherence model on it to calculate coherence score, like so:

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_df['bow_corpus'].tolist(), dictionary=dictionary, coherence='c_v')
with np.errstate(invalid='ignore'):
    lda_score = coherence_model_lda.get_coherence()

My LDA-Score is nan. What am I doing wrong here?

Ramsha Siddiqui
  • 460
  • 6
  • 20

2 Answers2

12

Solved! Coherence Model requires the original text, instead of the training corpus fed to LDA_Model - so when i ran this:

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_df['corpus'].tolist(), dictionary=dictionary, coherence='c_v')
with np.errstate(invalid='ignore'):
    lda_score = coherence_model_lda.get_coherence()

I got a coherence score of: 0.462

Hope this helps someone else making the same mistake. Thanks!

Ramsha Siddiqui
  • 460
  • 6
  • 20
  • Was facing the same issue! Thanks for sharing! – dshgna Jan 28 '21 at 20:26
  • Thank you! I'm testing it out now. How long did you wait for the coherence score to tabulate? I waited just under 2hours and it's still running – HOA Sep 14 '22 at 16:48
1

The documentation (https://radimrehurek.com/gensim/models/coherencemodel.html) says to provide "Tokenized texts" (list of list of str) - these should be your texts split into individual words that are in the dictionary you provide to CoherenceModel. If you provide the full texts that are not tokenized, there are no entries in the lookup dictionary for the words.

wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
  • upvote since this is a possible issue, but OP had done another mistake with the same solution – David May 08 '22 at 21:43