Why do we need the hyperparameters beta and alpha in LDA?

Question

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:

First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?

Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?

score 0 · Accepted Answer · answered Aug 10 '18 at 20:55

0

If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.

The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

answered Aug 10 '18 at 20:55

Wei Zhong

580
10
17

Thank you for your answer.. I have seen many implementations of LDA, the code is really simple: we first randomly assign topics to every document and word, then we use Gibs sampling to infer the posterior. However, I'm still confused where is the part in the code that use Dirichlet distribution. – Mr. Almars Oct 24 '18 at 08:18
@Mr.Almars The key reason to use Dirichlet is again, we simplified our computation, it is exactly why you do not see the Dirichlet form in the final Gibs sampling. If you search the root where Dirichlet form is gone, it is the $p(W|Z;\beta)$ and $p(Z; \alpha)$ that multiplies a factor and makes Dirichlet becomes products of beta function, and then at Gibbs updating rule, since $p(z_i = k \bar Z\ni, W; \Alpha, \Beta) &\propto \dfrac{ p(W, Z; \hyper) }{ p(Z\ni, W\ni ; \hyper) }$ has a fraction, many factors are cancelled out, the remaining is essentially Gibbs updating rule. – Wei Zhong Oct 25 '18 at 20:19
There is a single best article I recommend you to read better than reading Wiki: *Yi Wang, Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details.* I believe that article is good enough to solve your confusion. – Wei Zhong Oct 25 '18 at 20:22
Thank you for your reply. I will have a look at the book you mentioned – Mr. Almars Oct 28 '18 at 09:10

Why do we need the hyperparameters beta and alpha in LDA?

1 Answers1

Linked