5

I have written the following binary classification program in tensorflow that is buggy. The cost is returning to be zero all the time no matter what the input is. I am trying to debug a larger program which is not learning anything from the data. I have narrowed down at least one bug to the cost function always returning zero. The given program is using some random inputs and is having the same problem. self.X_train and self.y_train is originally supposed to read from files and the function self.predict() has more layers forming a feedforward neural network.

import numpy as np
import tensorflow as tf

class annClassifier():

    def __init__(self):

        with tf.variable_scope("Input"):
             self.X = tf.placeholder(tf.float32, shape=(100, 11))

        with tf.variable_scope("Output"):
            self.y = tf.placeholder(tf.float32, shape=(100, 1))

        self.X_train = np.random.rand(100, 11)
        self.y_train = np.random.randint(0,2, size=(100, 1))

    def predict(self):

        with tf.variable_scope('OutputLayer'):
            weights = tf.get_variable(name='weights',
                                      shape=[11, 1],
                                      initializer=tf.contrib.layers.xavier_initializer())
            bases = tf.get_variable(name='bases',
                                    shape=[1],
                                    initializer=tf.zeros_initializer())
            final_output = tf.matmul(self.X, weights) + bases

        return final_output

    def train(self):

        prediction = self.predict()
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=self.y))

        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())         
            print(sess.run(cost, feed_dict={self.X:self.X_train, self.y:self.y_train}))


with tf.Graph().as_default():
    classifier = annClassifier()
    classifier.train()

If someone could please figure out what I am doing wrong in this, I can try making the same change in my original program. Thanks a lot!

Ananda
  • 2,925
  • 5
  • 22
  • 45

2 Answers2

8

The only problem is invalid cost used. softmax_cross_entropy_with_logits should be used if you have more than two classes, as softmax of a single output always returns 1, as it is defined as :

softmax(x)_i = exp(x_i) / SUM_j exp(x_j)

so for a single number (one dimensional output)

softmax(x) = exp(x) / exp(x) = 1

Furthermore, for softmax output TF expects one-hot encoded labels, so if you provide only 0 or 1, there are two possibilities:

  1. True label is 0, so the cost is -0*log(1) = 0
  2. True label is 1, so the cost is -1*log(1) = 0

Tensorflow has a separate function to handle binary classification which applies sigmoid instead (note, that the same function for more than one output would apply sigmoid independently on each dimension which is what multi-label classification would expect):

tf.sigmoid_cross_entropy_with_logits

just switch to this cost and you are good to go, you do not have to encode anything as one-hot anymore either, as this function is designed solely to be used for your use-case.

The only missing bit is that .... your code does not have actual training routine you need to define optimiser, ask it to minimise a loss and then run a train op in the loop. In your current setting you just try to predict over and over, with the network which never changes.

In particular, please refer to Cross Entropy Jungle question on SO which provides more detailed description of all these different helper functions in TF (and other libraries), which have different requirements/use cases.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • sigmoid loss measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. This loss wont be effective for binary classification problem. – Ishant Mrinal Aug 12 '17 at 22:59
  • @Ishant, you seem to be confusing terms - for a **binary classification** there are just two options, either you are a member of class 1 or 2, this is why you can model **one** probability with sigmoid. P(y=1|x) = sigmoid(f(x)), as by the very definition of probability P(y=2|x) = 1 - P(y=1|x). There is no problem here with mutual exclusiveness. However, if you would have more than 2 classes then you are right, sigmoid cannot be applied. This is exactly how (among other models) logistic regression is derived. – lejlot Aug 12 '17 at 23:01
  • 1
    The harm is not big, but it is not "no harm" - you are allocating more memory (does not matter much if the last layer is small, but could matter otherwise), you are wasting computations (as there is nothing that the learning can benefit from, yet we have to compute additional gradient), finally for the neural networks it is not clear if this redundancy will not affect learning dynamics (since we still have very little understanding of loss surfaces of deep nets), thus it is a better strategy to avoid unnecessary complexity. – lejlot Aug 12 '17 at 23:27
  • Thank you. This is not the actual program that I am trying to debug, I was just trying to figure out why the cost was returning zero. In the actual version the cost over the iteration was stuck at constant zero and not learning anything. I figured it was better if I post just part that finds the constant so as to make it less cluttered. – Ananda Aug 13 '17 at 08:05
3

The softmax_cross_entropy_with_logits is basically a stable implementation of the 2 parts :

softmax = tf.nn.softmax(prediction)
cost = -tf.reduce_mean(labels * tf.log(softmax), 1)

Now in your example, prediction is a single value, so when you apply softmax on it, its going to be always 1 irrespective of the value (exp(prediction)/exp(prediction) = 1), and so the tf.log(softmax) term becomes 0. Thats why you always get your cost zero.

Either apply sigmoid to get your probabilities between 0 or 1 or if you use want to use softmax get the labels as [1, 0] for class 0 and [0, 1] for class 1.

Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59