0

I am trying to understand how the LDA topic model is implemented in mallet API. In the ParallelTopicModel class I can see a 2D int array called typeTopicCounts which is initialized in buildInitialTypeTopicCounts() method through some bitwise operations and later utilized for each document. My question is what does this array values signify? Only information I can get from the source code is that it is indexed by [feature index, topic index].

Sumanta
  • 1
  • 1

1 Answers1

0

The computational performance of Gibbs sampling for LDA is dominated by calculating the sampling distribution over topics for each word token. Topic models are set up to have lots of sparsity in the relationship between words and topics. If we can make this computation more efficient by saving as much computation as possible from one word to the next and only doing meaningful computations (like not multiplying by zero) we can get big speedups.

Each word type has one array of ints in the typeTopicCounts array. The meaning of each int value in this array encodes both a topic and a token count using bit shift operators. The count is in the high bits so we can sort topics by count without "unpacking" the integers.

Slides from a tutorial for this method are available here:

https://mimno.infosci.cornell.edu/slides/fast-sparse-sampling.pdf

David Mimno
  • 1,836
  • 7
  • 7