0

I was looking at the Tensorflow's basic neural network for beginners [1]. I am having trouble understanding the calculation of the entropy value and how its used. In the example a place holder is created to hold the correct labels:

y_ = tf.placeholder(tf.float32, [None, 10])

and the cross-entropy, sum y'.log(y), is calculated as follows:

reduct = -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])
cross_entropy = tf.reduce_mean( reduct )

Looking at the dimensions I assume we have (element wise multiplication):

y_ * log(y) = [batch x classes ] x [batch x classes ]

y_ * log(y) = [batch x classes ]

And a quick check confirms this:

y_ * tf.log(y)
<tf.Tensor 'mul_8:0' shape=(?, 10) dtype=float32>

Now here is what I don't understand. My understanding is that for cross-entropy we need to consider the distributions of y (predicted) and y_ (oracle). So I assume that we first need to reduce_mean of the y and the y_ by their columns (by class). I would then get 2 vectors of size:

y_ = [classes x 1 ]

y = [classes x 1 ]

Since y_ is the "correct" distribution, we then do a (notice that in the example the vectors are flipped):

log(y_) = [ classes x 1 ]

And now we do an element wise multiplication:

y x log(y_)

Which gives us a vector with the length of the classes. And finally we simply sum this vector to get a single value:

Hy(y_) = sum( y x log(y_) )

However, this does not seem to be the calculations that are being performed. Can anyone explain were my error is? Maybe point me to some page with a good explanation. In addition to this we are using one-hot encoding. So log(1) = 0 and log(0) = -infinity so this will cause errors in the calculations. I understand that the optimizer will calculate the derivatives, but isn't the cross-entropy still calculated?

TIA.

[1] https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html

user2051561
  • 838
  • 1
  • 7
  • 21

1 Answers1

0

Most of what you described is correct. However:

My understanding is that for cross-entropy we need to consider the distributions of y (predicted) and y_ (oracle). So I assume that we first need to reduce_mean of the y and the y_ by their columns (by class).

First you need to create a single vector (in your case with 10 elements per batch_member, then add over the batch) and then reduce_means it to get a single number. So the correct order of things is to compare y and y_ elementwise, then reduce.

Regarding the log(0)-problem: That's why you often see

reduct = -tf.reduce_sum(y_ * tf.log(y + 1e-5), reduction_indices=[1])
Phillip Bock
  • 1,879
  • 14
  • 23
  • Appreciate the feedback. In regards to the log(0) I have not seen code like the one you show above, but it makes sense. I have also read that under the covers Tensorflow deals with numerical stability issues, so I will assume this is already taken care of. As for the explanation I still don't get it. Cross entropy is about probability distributions. How can multiplying two matrices of "one-hot encoding" represent a distribution? I am missing something fundamental here. – user2051561 Aug 01 '16 at 15:50
  • 1
    Found what I was looking for. http://datascience.stackexchange.com/questions/9302/the-cross-entropy-error-function-in-neural-networks – user2051561 Aug 02 '16 at 13:38