Is there a built-in KL divergence loss function in TensorFlow?

Question

I have two tensors, prob_a and prob_b with shape [None, 1000], and I want to compute the KL divergence from prob_a to prob_b. Is there a built-in function for this in TensorFlow? I tried using tf.contrib.distributions.kl(prob_a, prob_b), but it gives:

NotImplementedError: No KL(dist_a || dist_b) registered for dist_a type Tensor and dist_b type Tensor

If there is no built-in function, what would be a good workaround?

meferne · Answer 1 · 2019-08-08T07:22:56.457

Assuming that your input tensors prob_a and prob_b are probability tensors that sum to 1 along the last axis, you could do it like this:

def kl(x, y):
    X = tf.distributions.Categorical(probs=x)
    Y = tf.distributions.Categorical(probs=y)
    return tf.distributions.kl_divergence(X, Y)

result = kl(prob_a, prob_b)

A simple example:

import numpy as np
import tensorflow as tf
a = np.array([[0.25, 0.1, 0.65], [0.8, 0.15, 0.05]])
b = np.array([[0.7, 0.2, 0.1], [0.15, 0.8, 0.05]])
sess = tf.Session()
print(kl(a, b).eval(session=sess))  # [0.88995184 1.08808468]

You would get the same result with

np.sum(a * np.log(a / b), axis=1)

However, this implementation is a bit buggy (checked in Tensorflow 1.8.0).

If you have zero probabilities in a, e.g. if you try [0.8, 0.2, 0.0] instead of [0.8, 0.15, 0.05], you will get nan even though by Kullback-Leibler definition 0 * log(0 / b) should contribute as zero.

To mitigate this, one should add some small numerical constant. It is also prudent to use tf.distributions.kl_divergence(X, Y, allow_nan_stats=False) to cause a runtime error in such situations.

Also, if there are some zeros in b, you will get inf values which won't be caught by the allow_nan_stats=False option so those have to be handled as well.

Your arrays `a` and `b` seems to sum to 1 on the last axis, not on the first — Luca Di Liello, Aug 06 '19 at 19:43
Yes, it would have been better to say "along axis 1", or even better, the last axis. I meant axis 1 when I wrote "along the first axis", as there is also axis 0. I'll edit the answer. Thanks! — meferne, Aug 07 '19 at 12:16
`AttributeError: module 'tensorflow' has no attribute 'distributions'` — jtlz2, Apr 08 '20 at 14:46

Jiecheng Zhao · Answer 2 · 2019-12-20T06:38:39.807

7

For there is softmax_cross_entropy_with_logits, there is no need to optimize on KL.

KL(prob_a, prob_b)  
  = Sum(prob_a * log(prob_a/prob_b))  
  = Sum(prob_a * log(prob_a) - prob_a * log(prob_b))  
  = - Sum(prob_a * log(prob_b)) + Sum(prob_a * log(prob_a)) 
  = - Sum(prob_a * log(prob_b)) + const 
  = H(prob_a, prob_b) + const

If prob_a is not const. You can rewrite it to the sub of two entropies.

KL(prob_a, prob_b)  
  = Sum(prob_a * log(prob_a/prob_b))  
  = Sum(prob_a * log(prob_a) - prob_a * log(prob_b))  
  = - Sum(prob_a * log(prob_b)) + Sum(prob_a * log(prob_a)) 
  = H(prob_a, prob_b) - H(prob_a, prob_a)

edited Dec 20 '19 at 06:38

answered Jun 29 '17 at 03:34

Jiecheng Zhao

71
1
3

There are cases where target probability `prob_a` changes during the optimisation. Then it becomes not constant. – CyberPlayerOne Feb 21 '19 at 10:04

score 5 · Answer 3 · edited Jan 26 '17 at 01:44

5

I'm not sure why it's not implemented, but perhaps there is a workaround. The KL divergence is defined as:

KL(prob_a, prob_b) = Sum(prob_a * log(prob_a/prob_b))

The cross entropy H, on the other hand, is defined as:

H(prob_a, prob_b) = -Sum(prob_a * log(prob_b))

So, if you create a variable y = prob_a/prob_b, you could obtain the KL divergence by calling negative H(proba_a, y). In Tensorflow notation, something like:

KL = tf.reduce_mean(-tf.nn.softmax_cross_entropy_with_logits(prob_a, y))

edited Jan 26 '17 at 01:44

Transcendental

929
3
8
25

answered Jan 26 '17 at 00:18

E.J. White

104
4

KL divergence must be 0 when `prob_a` = `prob_b`. But last line doesn't give 0. – Transcendental Jan 26 '17 at 00:33
Yes, it does. When `prob_a = prob_b`, we get `y = 1`. Then, `H(prob_a, y)` is zero from `log(y)`. Are you saying you checked it using Tensorflow's `softmax_cross_entropy_with_logits(prob_a, y)` and the result was not zero? – E.J. White Jan 27 '17 at 07:26
1

Exactly. TensorFlow's implementation might be slightly different than the actual formula. – Transcendental Jan 27 '17 at 11:17
2

Worth pointing out that softmax_cross_entropy_with_logits(prob_a,y) does not actually implement H(prob_a,y), it implements H(softmax(a),y). So using softmax_cross_entropy_with_logits will only work if you try to calculate the KL divergence on the activations of a softmax function (prob_a) and have access to the unscaled logits (a) – shapecatcher Aug 20 '19 at 13:17

score 2 · Answer 4 · answered Jul 18 '17 at 22:09

2

tf.contrib.distributions.kl takes instances of a tf.distribution not a Tensor.

Example:

  ds = tf.contrib.distributions
  p = ds.Normal(loc=0., scale=1.)
  q = ds.Normal(loc=1., scale=2.)
  kl = ds.kl_divergence(p, q)
  # ==> 0.44314718

answered Jul 18 '17 at 22:09

jvdillon

645
7
7

score 1 · Answer 5 · answered Mar 21 '18 at 23:41

1

Assuming that you have access to logits a and b:

prob_a = tf.nn.softmax(a)
cr_aa = tf.nn.softmax_cross_entropy_with_logits(prob_a, a)
cr_ab = tf.nn.softmax_cross_entropy_with_logits(prob_a, b)
kl_ab = tf.reduce_sum(cr_ab - cr_aa)

answered Mar 21 '18 at 23:41

Sara

21
1

Not going to work! From the [documentation](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits): "WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. *Do not call this op with the output of softmax, as it will produce incorrect results*" (emphasis mine) – mikkola Mar 22 '18 at 06:39
1

Assuming that you have access to logits a and b. This is not calling it on prob_a and prob_b. It is calling it on a and b. – Sara Nov 15 '18 at 22:36

score 0 · Answer 6 · answered Jan 18 '19 at 11:00

0

I think this might work:

tf.reduce_sum(p * tf.log(p/q))

where p is my actual probability distribution and q is my approximate probability distribution.

answered Jan 18 '19 at 11:00

Akshaya Natarajan

1,865
15
17

score 0 · Answer 7 · answered Feb 16 '19 at 19:37

I used the function from this code (from this Medium post) to calculate the KL-divergence of any given tensor from a normal Gaussian distribution, where sd is the standard deviation and mn is the tensor.

latent_loss = -0.5 * tf.reduce_sum(1.0 + 2.0 * sd - tf.square(mn) - tf.exp(2.0 * sd), 1)

Is there a built-in KL divergence loss function in TensorFlow?

7 Answers7

Linked