Mallet Hyperparameter optimization

Question

When training a topic model in mallet it is possible to learn hyperparameters during inference via the --optimize-interval [INTEGER] function. I have the following questions regarding this function:

Which paramters are learned? Are alpha and beta simultaneously learned or only one of them and if so which one?
What is the rational behinde the -use-symmetric-alpha function? The help within mallet says: "...Only optimize the concentration parameter of the prior over documents-topic distribution...". But the prior for the documents-topic distribution is alpha, isn´t it? So the command should be named -use-symmetric-beta follwoing the help. Or is just there just a mistake in the help text? Furthermore as far as I understood the literature (see e.g. Wallach et al. (2009): Rethinking LDA: Why Priors Matter) an asymmetric prior is only advantageous for the documents-topic distribution and brings no benefit for the topic-word distrubution. Alpha is the dirichlet prior for the documents-topic distribution. Following this I do not understand the sense of the -use-symmetric-alpha function.
Is there a possibilty in mallet to learn only the prior of the documents-topic distribution?

Thanks for any help.

J.Schneider · Answer 1 · 2021-05-19T16:11:45.193

I also struggled with this parameter and the misleading help text. Therefore, I simply tested it, compared different log outputs and finally searched through the source code and opened a PR which brought me to the following answers to your questions:

Which parameters are learned? Are alpha and beta simultaneously learned or only one of them and if so which one?

It depends. With default settings, no parameters are learned because --optimize-interval is 0. This means, alpha (⚠️ which is actually the sum of all alpha priors) stays the same, i.e., alpha_k = 5.0 / num-topics. Beta is by default 0.01. As a result, both alpha and beta are symmetrical Dirichlet priors. (Although, the alpha parameter is documented properly, it is still misleading when you are used to gensim where you specify alpha not alphaSum.)

If you provide a value for the parameter --optimize-interval greater than 0 (as well as pay attention to the --num-iterations (default: 1000) and the --optimize-burn-in (default: 200)), you enable hyperparameter optimization for both, alpha and beta. As a result, both alpha and beta are optimized Dirichlet priors learned from the data. However, alpha is learned as an asymmetrical prior and beta is always a symmetrical prior (even if the concentration parameter is optimized).

If you also set --use-symmetric-alpha True not only beta is optimized, but alpha as well, however you end up with a symmetrical alpha prior (initial value passed via the parameter --alpha) and a asymmetrical beta prior learned from the data based on the initial symmetrical prior passed with --beta. Wait, what!? Doesn't imply hyperparameter optimizing alpha learning an asymmetrical prior? This is not the case. The initially passed symmetric prior, to be precise the concentration parameter, can also be further optimized to better fit the data.

What is the rational behind the --use-symmetric-alpha function?

To be honest, I don't know. I only observed the behavior described above. Maybe for certain datasets an optimized but still symmetric alpha prior might make more sense, although it is not recommended by Wallach et al.

I previously wrongly assumed that an asymmetrical prior is learned for beta if --optimize-interval is set. That is not the case as can be seen here.

The help within mallet says: "...Only optimize the concentration parameter of the prior over documents-topic distribution...". But the prior for the documents-topic distribution is alpha, isn´t it?

You are right. Alpha is the prior for the documents-topic distribution and beta for the topic-words distribution.

So the command should be named --use-symmetric-beta following the help. Or is there just a mistake in the help text?

~~Indeed, it is a mistake in the help text.~~ Neither does the help text contain a mistake, nor the command name shall be changed. Without a bit more background knowledge about Dirichlet distributions, it is hard to understand what this option exactly does. I recommend the following slides by H. Wallach or this excellent explanation to a related question where the same misunderstanding occured.

The non-existing option --use-symmetric-beta is not implemented because beta is always symmetric in Mallet!

Furthermore as far as I understood the literature (see e.g. Wallach et al. (2009): Rethinking LDA: Why Priors Matter) an asymmetric prior is only advantageous for the documents-topic distribution and brings no benefit for the topic-word distribution. Alpha is the Dirichlet prior for the documents-topic distribution. Following this I do not understand the sense of the --use-symmetric-alpha function.

~~I totally agree with you. The parameter --use-symmetric-beta would make more sense (from my limited understanding).~~ Wallach et al. state that an asymmetric prior over the topic–word distributions provides no real benefit. That is exactly why Mallet uses only symmetric beta priors. Nevertheless, the beta Dirichlet prior, to be more precise the concentration parameter, will be further optimized if --optimize-interval is greater than 0. Wallach et al. answer the question why you should use asymmetric alpha priors over symmetric ones.

Furthermore, the help texts explain the consequences of using --use-symmetric-alpha as following:

This may reduce the number of very small, poorly estimated topics, but may disperse common words over several topics.

Lessons learned: You can’t (always) trust help texts. They might be misleading or assume certain background knowledge which leads to a misunderstanding. If you have trouble understanding the docs, search through the code. Source code never lies.

Mallet Hyperparameter optimization

1 Answers1