Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

Question

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.

What is the proper value of the coefficient of the KL loss?
1. In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
2. As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
3. What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
4. The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?

Welcome to StackOverflow! Please post some of the code you have tried so far. — Kevin, Nov 14 '19 at 02:19
regarding my first question, i just found a post who has a similar idea of deviding by the number of weights. I post the link here: https://groups.google.com/a/tensorflow.org/d/msg/tfprobability/PjFhdRBF8_Y/9skyJNDjCQAJ — Liang Zhang, Nov 14 '19 at 02:33
You're asking too many questions on this post. Please, try asking one question per post (next time). — nbro, Jan 27 '20 at 01:21

score 1 · Accepted Answer · 2020-05-12T05:35:47.640

In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?

In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.

In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.

As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?

The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).

Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.

Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).

What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?

I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).

The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.

Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).

I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?

I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.

Is batch normalization or group normalization still helpful in Bayesian deep learning?

For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

“A large loss does not mean a large gradient.Have a look at the gradients contributed” that is a nice point！ — Liang Zhang, Dec 17 '19 at 06:06
"In the TFP example, the `num_examples` variable refers to the number of mini-batches (is your M, assuming you mean this example", it is not correct. [`num_examples` is the number of training examples](https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/bayesian_neural_network.py#L264). It is the number of batches, only if each batch contains one training example. In the case of MNIST, it should be 55k (if I recall correctly). This issue has been raised in the TFP issue tracker before. — nbro, Jan 27 '20 at 01:29
Apologies @nbro , it looks like you are right about `num_samples` being the number of samples in the training set. I will edit my response now. To be consistent with the the non-stochastic version of learning and the methods proposed by [Graves] and BBB, the KL term should be scaled by the size of the batch. It can be scaled with larger values, and is a form of tempering or [Self-Bayes](https://arxiv.org/pdf/1910.09227v1.pdf). There is some discussion about the validity of it, can refer to [Sec. 8 here](https://arxiv.org/pdf/2002.08791.pdf) for more info . — , May 12 '20 at 05:27
@EthanGoan Actually, BbB paper presents the mini-batch version of the ELBO the KL divergence scaled by `1/M`, where `M` is the number of batches. So, the KL term should be scaled by the number of batches and not the size of the batch. Why do you think it should be the size of the batch? — nbro, May 12 '20 at 09:57
You are totally right @nbro, that is exactly what I meant. sorry about that. That is exactly what I edited my response to say, don't know why I wrote the size of the batch in that comment. — , May 12 '20 at 10:34

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

1 Answers1