1

Why am I getting same set of topics # words in gensim lda model? I used these parameters. I checked there are no duplicate documents in my corpus.

lda_model = gensim.models.ldamodel.LdaModel(corpus=MY_CORPUS,
                                           id2word=WORD_AND_ID,
                                           num_topics=4, 
                                           minimum_probability=minimum_probability,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto', # symmetric, asymmetric
                                           per_word_topics=True)

Results

[
(0, '0.004*lily + 0.01*rose + 0.00*jasmine'),
(1, '0.005*geometry + 0.07*algebra + 0.01*calculation'),
(2, '0.003*painting + 0.001*brush + 0.01*colors'),
(3, '0.005*geometry + 0.07*algebra + 0.01*calculation')
]

Notice: Topic #1 and #3 are identical.

sophros
  • 14,672
  • 11
  • 46
  • 75
sharp
  • 2,140
  • 9
  • 43
  • 80

2 Answers2

1

Each of the topics likely contains a large number of words weighted differently. When a topic is being displayed (e.g. using lda_model.show_topics()) you are going to get only a few words with the largest weights. This does not mean that there are no differences between topics among the remaining vocabulary.

You can steer the number of displayed words to inspect the remaining weights:

 show_topics(num_topics=4, num_words=10, log=False, formatted=True)

and change num_words parameter to include even more words.

Now, there is also a possibility that:

  • the number of topics should be different (e.g. 3),
  • or minimum_probability smaller (what is the value you use?),
  • or number of passes larger,
  • chunksize smaller,
  • corpus larger (what is the size?) or stripped off of stop words (did you do that?).

I encourage you to experiment with different values of these parameters to check if any of the combination works better.

sophros
  • 14,672
  • 11
  • 46
  • 75
  • Thanks. minimum probability I set to default: 0.01 for 1500 documents. Is there a good article/quora read on how to set minimum probabiliy? I will try different parameters see if that goes away. – sharp Jan 21 '21 at 13:12
  • I can only recommend my other answers: https://stackoverflow.com/questions/50805556/understanding-parameters-in-gensim-lda-model and https://stackoverflow.com/questions/65014553/how-to-tune-the-parameters-for-gensim-ldamulticore-in-python (BTW, would be great if you could upvote them if you find them useful; same for this answer). – sophros Jan 21 '21 at 13:41
0

you need to change the alpha parameter to 50/i which i is your topics number and use the eta parameter. (eta = 0.1)

like this code :

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                   id2word=id2word,
                                   num_topics=4, 
                                   update_every=1,
                                   chunksize=100,
                                   passes=10,
                                   alpha=50/4,
                                   eta = 0.1,     
                                   per_word_topics=True)