What is the held-out probability in Mallet LDA? How can we calculate Perplexity by the held-out probability?

Question

I am new to mallet. Now I would like to get the perplexity scores for 10-100 topics in my lda model so I run the held-our probability, it gives me the value of -8926490.73103205 for topic=100, which seems a little bit off. Is that the perplexity score? If now, how we can calculate the perplexity scores based on the output of held-out probability?

Type topic=10 and the held-out probability =-8968935.68290883.

score 0 · Answer 1 · answered Nov 01 '22 at 12:58

The value you're getting is the log probability of the entire held-out document set. This is the sum of the log probabilities of each word token. Individual word tokens usually have a log prob of around -7, so I'm guessing your held-out set is around 1M tokens. -7 is equivalent to a 1 in 1000 chance. When developing Mallet we usually just focused on log probability directly, you should check for formal definitions of perplexity from work that you want to compare to.

Things you can typically do with a log probability of a collection are divide by the number of tokens to get an average log prob per token. Negating this number and exponentiating will give you a positive score representing the "1 in X" that I mentioned above.

What is the held-out probability in Mallet LDA? How can we calculate Perplexity by the held-out probability?

1 Answers1