Why does my cross-entropy loss function get huge if I use a network of many relus?

Question

I have this loss function:

            loss_main = tf.reduce_mean(
                tf.nn.softmax_cross_entropy_with_logits(train_logits, train['labels']),
                name='loss_main',
            )

train_logits is defined from a pipeline built as follows:

    def build_logit_pipeline(data, include_dropout):
        # X --> *W1 --> +b1 --> relu --> *W2 --> +b2 ... --> softmax etc...
        pipeline = data

        for i in xrange(len(layer_sizes) - 1):
            last = i == len(layer_sizes) - 2
            with tf.name_scope("linear%d" % i):
                pipeline = tf.matmul(pipeline, weights[i])
                pipeline = tf.add(pipeline, biases[i])

            if not last:
                # insert relu after every one before the last
                with tf.name_scope("relu%d" % i):
                    pipeline = getattr(tf.nn, arg('act-func'))(pipeline)
                    if include_dropout and not arg('no-dropout'):
                        pipeline = tf.nn.dropout(pipeline, 0.5, name='dropout')

        return pipeline

The layer_sizes, weights, and biases are constructed like so:

    def make_weight(from_, to, name=None):
        return tf.Variable(tf.truncated_normal([from_, to], stddev=0.5), name=name)

    def make_bias(to, name=None):
        return tf.Variable(tf.truncated_normal([to], stddev=0.5), name=name)

    layer_sizes = [dataset.image_size**2] + arg('layers') + [dataset.num_classes]
    with tf.name_scope("parameters"):
        with tf.name_scope("weights"):
            weights = [make_weight(layer_sizes[i], layer_sizes[i+1], name="weights_%d" % i)
                       for i in xrange(len(layer_sizes) - 1)]

        with tf.name_scope("biases"):
            biases = [make_bias(layer_sizes[i + 1], name="biases_%d" % i)
                      for i in xrange(len(layer_sizes) - 1)]

If arg('act-func') is relu, then if I build a long chain of relus - like with arg('layers') being [750, 750, 750, 750, 750, 750] - then my loss function is huge:

Global step: 0
Batch loss function: 28593700.000000

If I have a shorter chain of relus - say arg('layers') is only [750] - then the loss function is smaller:

Global step: 0
Batch loss function: 96.377831

My question is: why is the loss function so dramatically different? As I understand it, the output of the logits is softmax'd to result in a probability distribution. Then the cross entropy is determined from this probability distribution, to the one-hot labels. Why would changing the number of relus I have change this function? I figure each network should be equally wrong at the start - about random - and so the loss would never grow too large.

Note that this loss function does not contain any l2 loss, so the increased number of weights and biases would not account for this.

Using arg('act-func') as tanh instead, this increase in loss does not occur - it stays about the same, as I would expect.

The cross entropy loss can indeed be unbounded if the model's probability for the correct label is 0. Now, you have to figure out if your model is producing 0 as the probability for the correct label when you see the huge loss. I would try printing out `tf.nn.softmax(train_logits)` to make sure that it is producing approximately `1/number of classes` as the probability at the beginning of training. — keveman, Jul 05 '16 at 23:27

score 1 · Answer 1 · answered Mar 29 '18 at 14:53

Check outputs of softmax first. If the outputs are like this:

[[0., 1.],
 [0., 1.],
 [0., 1.],
 ...
 [0., 1.]]

but the ground truth are like this:

[[1., 0.],
 [1., 0.],
 [1., 0.],
 ...
 [1., 0.]]

then the cross entropy loss would be very big. According to the formula of cross entropy:

-[ylog(a) + (1-y)log(1-a)]

where y is the ground truth and a is the output of softmax.

Sometimes, it is some "huge" feature value, which is not normalized, make these "wrong" output of softmax. According to the definition of softmax:

exp(z_j)/sum(exp(z_i)) for i=1 to D

where D is the dimension of vector z, if there are some "huge" components, the output of softmax will almost be 0 or 1.

Why does my cross-entropy loss function get huge if I use a network of many relus?

1 Answers1