I'm trying to figure out what the weight assigned to each word in a topic represents in Mallet.
I'm assuming it's some form of document occurrence count. However, I'm having a hard time figuring out how that figure is arrived at.
In my model, there are several words that occur in more than one topic, and in each topic they have a different weight assigned, so clearly the number is not the word count over the entire corpus. My next guess was that the number is the occurrences of the word in the total set of documents that are assigned to the topic, but when I tried to verify that manually, this seems to be incorrect.
As an example: I'm training a model over a corpus of about 12,000 documents (alpha 0.1, beta 0.01, t = 50). After training, my model has the following topic:
t1 = "knoflook (158.0), olie (156.0), ...."
So the word 'knoflook' is assigned a weight of 158. Yet when I manually count the number of documents in my corpus that contain that word and have t1
assigned, I get a completely different number (1855).
It's possible that my manual verification is off, of course, but it would be useful to know, in general, how the word weight in each topic is arrived at.
By the way, the above topic is a rendering based on the following code:
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = topicModel.getSortedWords();
for (int t = 0; t < numberOfTopics; t++) {
Iterator<IDSorter> iterator = topicSortedWords.get(t).iterator();
StringBuilder sb = new StringBuilder();
while (iterator.hasNext()) {
IDSorter idWeightPair = iterator.next();
final String wordLabel = dataAlphabet.lookupObject(idWeightPair.getID()).toString();
final double weight = idWeightPair.getWeight();
sb.append(wordLabel + " (" + weight + "), ");
}
sb.setLength(sb.length() - 2);
// sb.toString is now a human-readable representation of the topic
}