2

My question concerns the topic assignment in MALLET and the way it impacts the interpretation of the results.

The doc-topics-file states the proportion each topic has in a file. However, at the top of the list (58%) I encountered a file that does not use one of the words which constitute the topic X according to the topic-keys-file. In order to find an answer to this phenomenon I checked the output-state-file and learned that many words have been assigned to Topic X that do not appear in the topic-keys-list.

Why doesn’t mallet calculate the proportion of a topic in the doc-topics-file solely from the words that appear (as the most important for a topic) in the topics-keys-file?

1 Answers1

0

The topic keys output is only intended as a human-readable summary of the model. The topic is actually a probability distribution over the entire vocabulary, although for most words the probability in any given topic is only proportional to a smoothing parameter. Printing up to 100-200 top words for each topic can provide an even better sense of what the topic represents, but the default number of top words is chosen to fit about one topic per terminal line.

David Mimno
  • 1,836
  • 7
  • 7