2

I have trained an LDA model with MALLET on parts of the Stack Overflow data dump and did a 70/30 split for training and test data.

But the perplexity values are strange, because they are lower for the test set than for the training set. How is this possible? I thought the model is better fitted for the training data?

I have already double checked my perplexity calculations, but I do not find an error. Do you have any idea what the reason could be?

Thank you in advance!

enter image description here

Edit:

Instead of using the console output for the LL/token values of the training set, I have used the evaluator on the training set again. Now the values seem to be plausible.

enter image description here

phly
  • 185
  • 1
  • 12

1 Answers1

3

That makes sense. The LL/token number is giving you the probability of both topic assignments and the observed words, whereas the held-out probability is giving you the marginal probability of just the observed words, summed over topics.

David Mimno
  • 1,836
  • 7
  • 7