Cost and activation functions for multiple independent labels

Question

After completing the mnist/cifar tutorials I thought I'd experiment with tensorflow by making my own 'large' set of data and for simplicity's sake, I settled on a black-on-white oval shape that changes its height and width independently on a 0.0-1.0 scale as a 28x28 pixel image (of which I have 5000 training images, 1000 test images).

My code uses the 'MNIST expert' tutorial as a basis (scaled back for speed) but I switched in a squared-error based cost function and, based on advice from here, swapped in a sigmoid function for the final activation layer, given that this isn't a classification but rather a 'best-fit' between two tensors, y_ and y_conv.

However over the course of >100k iterations, the loss output quickly settles into an oscilation between 400 and 900 (or, consequently, 0.2-0.3 of any given label averaged over 2 labels in a batch of 50) so I imagine I'm just getting noise. Perhaps I'm mistaken but I was hoping to use Tensorflow to convolve images so as to deduce maybe 10 or more independent labelled variables. Am I missing something fundamental here?

def train(images, labels):

# Import data
oval = blender_input_data.read_data_sets(images, labels)

sess = tf.InteractiveSession()

# Establish placeholders
x = tf.placeholder("float", shape=[None, 28, 28, 1])
tf.image_summary('images', x)
y_ = tf.placeholder("float", shape=[None, 2])

# Functions for Weight Initialization.

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

# Functions for convolution and pooling

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

# First Variables

W_conv1 = weight_variable([5, 5, 1, 16])
b_conv1 = bias_variable([16])

# First Convolutional Layer.
h_conv1 = tf.nn.relu(conv2d(x, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
_ = tf.histogram_summary('weights 1', W_conv1)
_ = tf.histogram_summary('biases 1', b_conv1)

# Second Variables
W_conv2 = weight_variable([5, 5, 16, 32])
b_conv2 = bias_variable([32])

# Second Convolutional Layer
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
_ = tf.histogram_summary('weights 2', W_conv2)
_ = tf.histogram_summary('biases 2', b_conv2)

# Fully connected Variables
W_fc1 = weight_variable([7 * 7 * 32, 512])
b_fc1 = bias_variable([512])

# Fully connected Layer
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*32])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1)+b_fc1)
_ = tf.histogram_summary('weights 3', W_fc1)
_ = tf.histogram_summary('biases 3', b_fc1)

# Drop out to reduce overfitting
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Readout layer with sigmoid activation function.
W_fc2 = weight_variable([512, 2])
b_fc2 = bias_variable([2])

with tf.name_scope('Wx_b'):
    y_conv=tf.sigmoid(tf.matmul(h_fc1_drop, W_fc2)+b_fc2)
    _ = tf.histogram_summary('weights 4', W_fc2)
    _ = tf.histogram_summary('biases 4', b_fc2)
    _ = tf.histogram_summary('y', y_conv)

# Loss with squared errors
with tf.name_scope('diff'):
    error = tf.reduce_sum(tf.abs(tf.sub(y_,y_conv)))
    diff = (error*error)
    _ = tf.scalar_summary('diff', diff)

# Train
with tf.name_scope('train'):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(diff)

# Merge summaries and write them out.
merged = tf.merge_all_summaries()
writer = tf.train.SummaryWriter('/home/user/TBlogs/oval_logs', sess.graph_def)

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

# Launch the session.
sess.run(tf.initialize_all_variables())

# Restore variables from disk.
saver.restore(sess, "/home/user/TBlogs/model.ckpt")


for i in range(100000):

    batch = oval.train.next_batch(50)
    t_batch = oval.test.next_batch(50)

    if i%10 == 0:
        feed = {x:t_batch[0], y_: t_batch[1], keep_prob: 1.0}
        result = sess.run([merged, diff], feed_dict=feed)
        summary_str = result[0]
        df = result[1]

        writer.add_summary(summary_str, i)
        print('Difference:%s' % (df)
    else:
        feed = {x:batch[0], y_: batch[1], keep_prob: 0.5}
        sess.run(train_step, feed_dict=feed)

    if i%1000 == 0:
        save_path = saver.save(sess, "/home/user/TBlogs/model.ckpt")

# Completion
print("Session Done")

I'm most concerned over how tensor board seems to show that the weights are barely changing whatsoever, even after hours and hours of training and a decaying learning rate (although not shown in the code). My understanding of machine learning is that, when convolving images, the layers effectively amounts to layers of edge detection.....so I'm confused as to why they should barely change.

My theories currently are:
1. I've overlooked/misunderstood something regarding the loss function.
2. I've misunderstood how weights are initialized/updated
3. I've grossly underestimated how long the process should take...although, again, the loss seems to simply be oscillating.

Any help would be greatly appreciated, thanks!

score 0 · Answer 1 · answered Mar 11 '16 at 00:18

From what I can see, your cost function is not the usual mean square error.
You are optimizing tf.reduce_sum(tf.abs(tf.sub(y_,y_conv))) squared. This function is not differentiable in 0 (it is the square of the l1 norm). This can cause some stability issues (especially in the back propagation steps, I don't know what kind of sub gradient they use in this case).

The usual mean square error can be written as

residual = tf.sub(y_, y_conv)
error = tf.reduce_mean(tf.reduce_sum(residual*residual, reduction_indices=[1]))

(the use of mean and sum is to avoid having value dependent on the batch size). This is differentiable and should give you better behavior.

Cost and activation functions for multiple independent labels

1 Answers1