I am trying to understand how the LDA topic model is implemented in mallet API. In the ParallelTopicModel
class I can see a 2D int array called typeTopicCounts
which is initialized in buildInitialTypeTopicCounts()
method through some bitwise operations and later utilized for each document. My question is what does this array values signify? Only information I can get from the source code is that it is indexed by [feature index, topic index].

- 1
- 1
1 Answers
The computational performance of Gibbs sampling for LDA is dominated by calculating the sampling distribution over topics for each word token. Topic models are set up to have lots of sparsity in the relationship between words and topics. If we can make this computation more efficient by saving as much computation as possible from one word to the next and only doing meaningful computations (like not multiplying by zero) we can get big speedups.
Each word type has one array of int
s in the typeTopicCounts
array. The meaning of each int
value in this array encodes both a topic and a token count using bit shift operators. The count is in the high bits so we can sort topics by count without "unpacking" the integers.
Slides from a tutorial for this method are available here:
https://mimno.infosci.cornell.edu/slides/fast-sparse-sampling.pdf

- 1,836
- 7
- 7