15

I have two tensors, prob_a and prob_b with shape [None, 1000], and I want to compute the KL divergence from prob_a to prob_b. Is there a built-in function for this in TensorFlow? I tried using tf.contrib.distributions.kl(prob_a, prob_b), but it gives:

NotImplementedError: No KL(dist_a || dist_b) registered for dist_a type Tensor and dist_b type Tensor

If there is no built-in function, what would be a good workaround?

nbro
  • 15,395
  • 32
  • 113
  • 196
Transcendental
  • 929
  • 3
  • 8
  • 25

7 Answers7

12

Assuming that your input tensors prob_a and prob_b are probability tensors that sum to 1 along the last axis, you could do it like this:

def kl(x, y):
    X = tf.distributions.Categorical(probs=x)
    Y = tf.distributions.Categorical(probs=y)
    return tf.distributions.kl_divergence(X, Y)

result = kl(prob_a, prob_b)

A simple example:

import numpy as np
import tensorflow as tf
a = np.array([[0.25, 0.1, 0.65], [0.8, 0.15, 0.05]])
b = np.array([[0.7, 0.2, 0.1], [0.15, 0.8, 0.05]])
sess = tf.Session()
print(kl(a, b).eval(session=sess))  # [0.88995184 1.08808468]

You would get the same result with

np.sum(a * np.log(a / b), axis=1) 

However, this implementation is a bit buggy (checked in Tensorflow 1.8.0).

If you have zero probabilities in a, e.g. if you try [0.8, 0.2, 0.0] instead of [0.8, 0.15, 0.05], you will get nan even though by Kullback-Leibler definition 0 * log(0 / b) should contribute as zero.

To mitigate this, one should add some small numerical constant. It is also prudent to use tf.distributions.kl_divergence(X, Y, allow_nan_stats=False) to cause a runtime error in such situations.

Also, if there are some zeros in b, you will get inf values which won't be caught by the allow_nan_stats=False option so those have to be handled as well.

meferne
  • 349
  • 2
  • 9
  • Your arrays `a` and `b` seems to sum to 1 on the last axis, not on the first – Luca Di Liello Aug 06 '19 at 19:43
  • Yes, it would have been better to say "along axis 1", or even better, the last axis. I meant axis 1 when I wrote "along the first axis", as there is also axis 0. I'll edit the answer. Thanks! – meferne Aug 07 '19 at 12:16
  • `AttributeError: module 'tensorflow' has no attribute 'distributions'` – jtlz2 Apr 08 '20 at 14:46
7

For there is softmax_cross_entropy_with_logits, there is no need to optimize on KL.

KL(prob_a, prob_b)  
  = Sum(prob_a * log(prob_a/prob_b))  
  = Sum(prob_a * log(prob_a) - prob_a * log(prob_b))  
  = - Sum(prob_a * log(prob_b)) + Sum(prob_a * log(prob_a)) 
  = - Sum(prob_a * log(prob_b)) + const 
  = H(prob_a, prob_b) + const 

If prob_a is not const. You can rewrite it to the sub of two entropies.

KL(prob_a, prob_b)  
  = Sum(prob_a * log(prob_a/prob_b))  
  = Sum(prob_a * log(prob_a) - prob_a * log(prob_b))  
  = - Sum(prob_a * log(prob_b)) + Sum(prob_a * log(prob_a)) 
  = H(prob_a, prob_b) - H(prob_a, prob_a)  
Jiecheng Zhao
  • 71
  • 1
  • 3
5

I'm not sure why it's not implemented, but perhaps there is a workaround. The KL divergence is defined as:

KL(prob_a, prob_b) = Sum(prob_a * log(prob_a/prob_b))

The cross entropy H, on the other hand, is defined as:

H(prob_a, prob_b) = -Sum(prob_a * log(prob_b))

So, if you create a variable y = prob_a/prob_b, you could obtain the KL divergence by calling negative H(proba_a, y). In Tensorflow notation, something like:

KL = tf.reduce_mean(-tf.nn.softmax_cross_entropy_with_logits(prob_a, y))

Transcendental
  • 929
  • 3
  • 8
  • 25
E.J. White
  • 104
  • 4
  • KL divergence must be 0 when `prob_a` = `prob_b`. But last line doesn't give 0. – Transcendental Jan 26 '17 at 00:33
  • Yes, it does. When `prob_a = prob_b`, we get `y = 1`. Then, `H(prob_a, y)` is zero from `log(y)`. Are you saying you checked it using Tensorflow's `softmax_cross_entropy_with_logits(prob_a, y)` and the result was not zero? – E.J. White Jan 27 '17 at 07:26
  • 1
    Exactly. TensorFlow's implementation might be slightly different than the actual formula. – Transcendental Jan 27 '17 at 11:17
  • 2
    Worth pointing out that softmax_cross_entropy_with_logits(prob_a,y) does not actually implement H(prob_a,y), it implements H(softmax(a),y). So using softmax_cross_entropy_with_logits will only work if you try to calculate the KL divergence on the activations of a softmax function (prob_a) and have access to the unscaled logits (a) – shapecatcher Aug 20 '19 at 13:17
2

tf.contrib.distributions.kl takes instances of a tf.distribution not a Tensor.

Example:

  ds = tf.contrib.distributions
  p = ds.Normal(loc=0., scale=1.)
  q = ds.Normal(loc=1., scale=2.)
  kl = ds.kl_divergence(p, q)
  # ==> 0.44314718
jvdillon
  • 645
  • 7
  • 7
1

Assuming that you have access to logits a and b:

prob_a = tf.nn.softmax(a)
cr_aa = tf.nn.softmax_cross_entropy_with_logits(prob_a, a)
cr_ab = tf.nn.softmax_cross_entropy_with_logits(prob_a, b)
kl_ab = tf.reduce_sum(cr_ab - cr_aa)
Sara
  • 21
  • 1
  • Not going to work! From the [documentation](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits): "WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. *Do not call this op with the output of softmax, as it will produce incorrect results*" (emphasis mine) – mikkola Mar 22 '18 at 06:39
  • 1
    Assuming that you have access to logits a and b. This is not calling it on prob_a and prob_b. It is calling it on a and b. – Sara Nov 15 '18 at 22:36
0

I think this might work:

tf.reduce_sum(p * tf.log(p/q))

where p is my actual probability distribution and q is my approximate probability distribution.

Akshaya Natarajan
  • 1,865
  • 15
  • 17
0

I used the function from this code (from this Medium post) to calculate the KL-divergence of any given tensor from a normal Gaussian distribution, where sd is the standard deviation and mn is the tensor.

latent_loss = -0.5 * tf.reduce_sum(1.0 + 2.0 * sd - tf.square(mn) - tf.exp(2.0 * sd), 1)