I have this loss function:
loss_main = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(train_logits, train['labels']),
name='loss_main',
)
train_logits
is defined from a pipeline built as follows:
def build_logit_pipeline(data, include_dropout):
# X --> *W1 --> +b1 --> relu --> *W2 --> +b2 ... --> softmax etc...
pipeline = data
for i in xrange(len(layer_sizes) - 1):
last = i == len(layer_sizes) - 2
with tf.name_scope("linear%d" % i):
pipeline = tf.matmul(pipeline, weights[i])
pipeline = tf.add(pipeline, biases[i])
if not last:
# insert relu after every one before the last
with tf.name_scope("relu%d" % i):
pipeline = getattr(tf.nn, arg('act-func'))(pipeline)
if include_dropout and not arg('no-dropout'):
pipeline = tf.nn.dropout(pipeline, 0.5, name='dropout')
return pipeline
The layer_sizes
, weights
, and biases
are constructed like so:
def make_weight(from_, to, name=None):
return tf.Variable(tf.truncated_normal([from_, to], stddev=0.5), name=name)
def make_bias(to, name=None):
return tf.Variable(tf.truncated_normal([to], stddev=0.5), name=name)
layer_sizes = [dataset.image_size**2] + arg('layers') + [dataset.num_classes]
with tf.name_scope("parameters"):
with tf.name_scope("weights"):
weights = [make_weight(layer_sizes[i], layer_sizes[i+1], name="weights_%d" % i)
for i in xrange(len(layer_sizes) - 1)]
with tf.name_scope("biases"):
biases = [make_bias(layer_sizes[i + 1], name="biases_%d" % i)
for i in xrange(len(layer_sizes) - 1)]
If arg('act-func')
is relu, then if I build a long chain of relus - like with arg('layers')
being [750, 750, 750, 750, 750, 750]
- then my loss function is huge:
Global step: 0
Batch loss function: 28593700.000000
If I have a shorter chain of relus - say arg('layers')
is only [750]
- then the loss function is smaller:
Global step: 0
Batch loss function: 96.377831
My question is: why is the loss function so dramatically different? As I understand it, the output of the logits is softmax'd to result in a probability distribution. Then the cross entropy is determined from this probability distribution, to the one-hot labels. Why would changing the number of relus I have change this function? I figure each network should be equally wrong at the start - about random - and so the loss would never grow too large.
Note that this loss function does not contain any l2 loss, so the increased number of weights and biases would not account for this.
Using arg('act-func')
as tanh
instead, this increase in loss does not occur - it stays about the same, as I would expect.