4

I have some troubles trying to set up a multilayer perceptron for binary classification using tensorflow.

I have a very large dataset (about 1,5*10^6 examples) each with a binary (0/1) label and 100 features. What I need to do is to set up a simple MLP and then try to change the learning rate and the initialization pattern to document the results (it's an assignment). I am getting strange results, though, as my MLP seem to get stuck with a low-but-not-great cost early and never getting off of it. With fairly low values of learning rate the cost goes NAN almost immediately. I don't know if the problem lies in how I structured the MLP (I did a few tries, going to post the code for the last one) or if I am missing something with my tensorflow implementation.

CODE

import tensorflow as tf
import numpy as np
import scipy.io

# Import and transform dataset
print("Importing dataset.")
dataset = scipy.io.mmread('tfidf_tsvd.mtx')

with open('labels.txt') as f:
    all_labels = f.readlines()

all_labels = np.asarray(all_labels)
all_labels = all_labels.reshape((1498271,1))

# Split dataset into training (66%) and test (33%) set
training_set    = dataset[0:1000000]
training_labels = all_labels[0:1000000]
test_set        = dataset[1000000:1498272]
test_labels     = all_labels[1000000:1498272]

print("Dataset ready.") 

# Parameters
learning_rate   = 0.01 #argv
mini_batch_size = 100
training_epochs = 10000
display_step    = 500

# Network Parameters
n_hidden_1  = 64    # 1st hidden layer of neurons
n_hidden_2  = 32    # 2nd hidden layer of neurons
n_hidden_3  = 16    # 3rd hidden layer of neurons
n_input     = 100   # number of features after LSA

# Tensorflow Graph input
x = tf.placeholder(tf.float64, shape=[None, n_input], name="x-data")
y = tf.placeholder(tf.float64, shape=[None, 1], name="y-labels")

print("Creating model.")

# Create model
def multilayer_perceptron(x, weights):
    # First hidden layer with SIGMOID activation
    layer_1 = tf.matmul(x, weights['h1'])
    layer_1 = tf.nn.sigmoid(layer_1)
    # Second hidden layer with SIGMOID activation
    layer_2 = tf.matmul(layer_1, weights['h2'])
    layer_2 = tf.nn.sigmoid(layer_2)
    # Third hidden layer with SIGMOID activation
    layer_3 = tf.matmul(layer_2, weights['h3'])
    layer_3 = tf.nn.sigmoid(layer_3)
    # Output layer with SIGMOID activation
    out_layer = tf.matmul(layer_2, weights['out'])
    return out_layer

# Layer weights, should change them to see results
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], dtype=np.float64)),       
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], dtype=np.float64)),
    'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],dtype=np.float64)),
    'out': tf.Variable(tf.random_normal([n_hidden_2, 1], dtype=np.float64))
}

# Construct model
pred = multilayer_perceptron(x, weights)

# Define loss and optimizer
cost = tf.nn.l2_loss(pred-y,name="squared_error_cost")
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initializing the variables
init = tf.initialize_all_variables()

print("Model ready.")

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    print("Starting Training.")

    # Training cycle
    for epoch in range(training_epochs):
        #avg_cost = 0.
        # minibatch loading
        minibatch_x = training_set[mini_batch_size*epoch:mini_batch_size*(epoch+1)]
        minibatch_y = training_labels[mini_batch_size*epoch:mini_batch_size*(epoch+1)]
        # Run optimization op (backprop) and cost op
        _, c = sess.run([optimizer, cost], feed_dict={x: minibatch_x, y: minibatch_y})

        # Compute average loss
        avg_cost = c / (minibatch_x.shape[0])

        # Display logs per epoch
        if (epoch) % display_step == 0:
        print("Epoch:", '%05d' % (epoch), "Training error=", "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

    # Test model
    # Calculate accuracy
    test_error = tf.nn.l2_loss(pred-y,name="squared_error_test_cost")/test_set.shape[0]
    print("Test Error:", test_error.eval({x: test_set, y: test_labels}))

OUTPUT

python nn.py
Importing dataset.
Dataset ready.
Creating model.
Model ready.
Starting Training.
Epoch: 00000 Training error= 0.331874878
Epoch: 00500 Training error= 0.121587482
Epoch: 01000 Training error= 0.112870921
Epoch: 01500 Training error= 0.110293652
Epoch: 02000 Training error= 0.122655269
Epoch: 02500 Training error= 0.124971940
Epoch: 03000 Training error= 0.125407845
Epoch: 03500 Training error= 0.131942481
Epoch: 04000 Training error= 0.121696954
Epoch: 04500 Training error= 0.116669835
Epoch: 05000 Training error= 0.129558477
Epoch: 05500 Training error= 0.122952110
Epoch: 06000 Training error= 0.124655344
Epoch: 06500 Training error= 0.119827300
Epoch: 07000 Training error= 0.125183779
Epoch: 07500 Training error= 0.156429254
Epoch: 08000 Training error= 0.085632880
Epoch: 08500 Training error= 0.133913128
Epoch: 09000 Training error= 0.114762624
Epoch: 09500 Training error= 0.115107805
Optimization Finished!
Test Error: 0.116647016708

This is what MMN advised

weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], stddev=0, dtype=np.float64)),     
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], stddev=0.01, dtype=np.float64)),
    'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],  stddev=0.01, dtype=np.float64)),
    'out': tf.Variable(tf.random_normal([n_hidden_2, 1], dtype=np.float64))
}

This is the output

Epoch: 00000 Training error= 0.107566668
Epoch: 00500 Training error= 0.289380907
Epoch: 01000 Training error= 0.339091784
Epoch: 01500 Training error= 0.358559815
Epoch: 02000 Training error= 0.122639698
Epoch: 02500 Training error= 0.125160135
Epoch: 03000 Training error= 0.126219718
Epoch: 03500 Training error= 0.132500418
Epoch: 04000 Training error= 0.121795254
Epoch: 04500 Training error= 0.116499476
Epoch: 05000 Training error= 0.124532673
Epoch: 05500 Training error= 0.124484790
Epoch: 06000 Training error= 0.118491177
Epoch: 06500 Training error= 0.119977633
Epoch: 07000 Training error= 0.127532511
Epoch: 07500 Training error= 0.159053519
Epoch: 08000 Training error= 0.083876224
Epoch: 08500 Training error= 0.131488483
Epoch: 09000 Training error= 0.123161189
Epoch: 09500 Training error= 0.125011362
Optimization Finished!
Test Error: 0.129284643093

Connected third hidden layer, thanks to MMN

There was a mistake in my code and I had two hidden layers instead of three. I corrected doing:

'out': tf.Variable(tf.random_normal([n_hidden_3, 1], dtype=np.float64))

and

out_layer = tf.matmul(layer_3, weights['out'])

I returned to the old value for stddev though, as it seems to cause less fluctuation in the cost function.

The output is still troubling

Epoch: 00000 Training error= 0.477673073
Epoch: 00500 Training error= 0.121848744
Epoch: 01000 Training error= 0.112854530
Epoch: 01500 Training error= 0.110597624
Epoch: 02000 Training error= 0.122603499
Epoch: 02500 Training error= 0.125051472
Epoch: 03000 Training error= 0.125400717
Epoch: 03500 Training error= 0.131999354
Epoch: 04000 Training error= 0.121850889
Epoch: 04500 Training error= 0.116551533
Epoch: 05000 Training error= 0.129749704
Epoch: 05500 Training error= 0.124600464
Epoch: 06000 Training error= 0.121600218
Epoch: 06500 Training error= 0.121249676
Epoch: 07000 Training error= 0.132656938
Epoch: 07500 Training error= 0.161801757
Epoch: 08000 Training error= 0.084197352
Epoch: 08500 Training error= 0.132197409
Epoch: 09000 Training error= 0.123249055
Epoch: 09500 Training error= 0.126602369
Optimization Finished!
Test Error: 0.129230736355

Two more changes thanks to Steven So Steven proposed to change Sigmoid activation function with ReLu, and so I tried. In the mean time, I noticed I didn't set an activation function for the output node, so I did that too (should be easy to see what I changed).

Starting Training.
Epoch: 00000 Training error= 293.245977809
Epoch: 00500 Training error= 0.290000000
Epoch: 01000 Training error= 0.340000000
Epoch: 01500 Training error= 0.360000000
Epoch: 02000 Training error= 0.285000000
Epoch: 02500 Training error= 0.250000000
Epoch: 03000 Training error= 0.245000000
Epoch: 03500 Training error= 0.260000000
Epoch: 04000 Training error= 0.290000000
Epoch: 04500 Training error= 0.315000000
Epoch: 05000 Training error= 0.285000000
Epoch: 05500 Training error= 0.265000000
Epoch: 06000 Training error= 0.340000000
Epoch: 06500 Training error= 0.180000000
Epoch: 07000 Training error= 0.370000000
Epoch: 07500 Training error= 0.175000000
Epoch: 08000 Training error= 0.105000000
Epoch: 08500 Training error= 0.295000000
Epoch: 09000 Training error= 0.280000000
Epoch: 09500 Training error= 0.285000000
Optimization Finished!
Test Error: 0.220196439287

This is what it does with the Sigmoid activation function on every node, output included

Epoch: 00000 Training error= 0.110878121
Epoch: 00500 Training error= 0.119393080
Epoch: 01000 Training error= 0.109229532
Epoch: 01500 Training error= 0.100436962
Epoch: 02000 Training error= 0.113160662
Epoch: 02500 Training error= 0.114200962
Epoch: 03000 Training error= 0.109777990
Epoch: 03500 Training error= 0.108218725
Epoch: 04000 Training error= 0.103001394
Epoch: 04500 Training error= 0.084145737
Epoch: 05000 Training error= 0.119173495
Epoch: 05500 Training error= 0.095796251
Epoch: 06000 Training error= 0.093336573
Epoch: 06500 Training error= 0.085062860
Epoch: 07000 Training error= 0.104251661
Epoch: 07500 Training error= 0.105910949
Epoch: 08000 Training error= 0.090347288
Epoch: 08500 Training error= 0.124480612
Epoch: 09000 Training error= 0.109250224
Epoch: 09500 Training error= 0.100245836
Optimization Finished!
Test Error: 0.110234139674

I found these numbers very strange, in the first case, it is stuck in a higher cost than sigmoid, even though sigmoid should saturate very early. In the second case, it starts with a training error which is almost the last one... so it basically converges with one mini-batch. I'm starting to think that I am not calculating the cost correctly, in this line: avg_cost = c / (minibatch_x.shape[0])

Darkobra
  • 85
  • 7

3 Answers3

2

So it could be a couple of things:

  1. You could be saturating the sigmoid units (as MMN mentioned) I would suggest trying relu units instead.

replace:

tf.nn.sigmoid(layer_n)

with:

tf.nn.relu(layer_n)
  1. Your model may not have the expressive power to actually learn your data. I.e. it would need to be deeper.
  2. You can also try a different optimizer like Adam() as such

replace:

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

with:

optimizer = tf.train.AdamOptimizer().minimize(cost)

A few other points:

  1. You should add a bias term to your weights

like so:

biases = {
 'b1': tf.Variable(tf.random_normal([n_hidden_1],   dtype=np.float64)),       
 'b2': tf.Variable(tf.random_normal([n_hidden_2], dtype=np.float64)),
 'b3': tf.Variable(tf.random_normal([n_hidden_3],dtype=np.float64)),
 'bout': tf.Variable(tf.random_normal([1], dtype=np.float64))
 }

def multilayer_perceptron(x, weights):
    # First hidden layer with SIGMOID activation
    layer_1 = tf.matmul(x, weights['h1']) + biases['b1']
    layer_1 = tf.nn.sigmoid(layer_1)
    # Second hidden layer with SIGMOID activation
    layer_2 = tf.matmul(layer_1, weights['h2']) + biases['b2']
    layer_2 = tf.nn.sigmoid(layer_2)
    # Third hidden layer with SIGMOID activation
    layer_3 = tf.matmul(layer_2, weights['h3']) + biases['b3']
    layer_3 = tf.nn.sigmoid(layer_3)
    # Output layer with SIGMOID activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['bout']
    return out_layer
  1. and you can update the learning rate over time

like so:

    learning_rate = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
                                           global_step,
                                           decay_steps,
                                           LEARNING_RATE_DECAY_FACTOR,
                                           staircase=True)

You just need to define the decay steps i.e. when to decay and LEARNING_RATE_DECAY_FACTOR i.e. decay by how much.

Steven
  • 5,134
  • 2
  • 27
  • 38
  • I've edited the answer with your proposals. Noting that: 1. relu gives very strange values, you can read it in the edit of the question. 2. I've made the model deeper, as it had 2 hidden layers before due to a mistake of mine and it has 3 hidden layers now. 3. I really can't use Adam optimizer as it would go against the purpose of my assignment, which is to play with the learning rate and a few initialization parameters. Do you think I am calculating the cost correctly, after every mini_batch? – Darkobra Oct 02 '16 at 17:02
  • There are different cost functions so it really depends on your task. I can't really answer that question for without knowing the assignment to say if l2 loss is correct or maybe cross entropy or something else. You are using l2 loss correctly though. – Steven Oct 02 '16 at 18:59
  • One other simple thing that's "obvious" but sometimes goes unnoticed make sure your labels correspond to the correct training inputs. – Steven Oct 02 '16 at 19:08
  • Yes, they do. I checked, thanks. Your answer helped me a lot, though, so I think I'll mark it as the correct one. Playing with the activation function turned out to be the solution. I am now playing with stochastic gradient descent, batch gradient descent, biases and other things. Thanks! I'd like to thank MMN too, as he too helped me a lot. – Darkobra Oct 04 '16 at 11:59
1

Your weights at initialized with a stddev of 1, so the output of layer 1 will have a stddev of 10 or so. This might be saturating the sigmoid functions to the point most gradients are 0.

Can you try initializing the hidden weights with a stddev of .01?

MMN
  • 576
  • 5
  • 7
  • Looks like this 00000 Tr err= 0.107566 00500 Tr err= 0.289380 01000 Tr err= 0.339091 01500 Tr err= 0.358559 02000 Tr err= 0.122639 02500 Tr err= 0.125160 03000 Tr err= 0.126219 03500 Tr err= 0.132500 04000 Tr err= 0.121795 04500 Tr err= 0.116499 05000 Tr err= 0.124532 05500 Tr err= 0.124484 06000 Tr err= 0.118491 06500 Tr err= 0.119977 07000 Tr err= 0.127532 07500 Tr err= 0.159053 08000 Tr err= 0.083876 08500 Tr err= 0.131488 09000 Tr err= 0.123161 09500 Tr err= 0.125011 Te Err: 0.129284643 – Darkobra Oct 02 '16 at 15:45
  • Uhm, can't give proper shape to comments, yet I can tell you this didn't solve my problem. – Darkobra Oct 02 '16 at 15:51
  • hmm, maybe that's the best you are going to get with a two-layer network? Did you mean to not use h3? – MMN Oct 02 '16 at 16:04
  • Ouch. I haven't connected layer_3. I'll try now. – Darkobra Oct 02 '16 at 16:06
  • Still no luck, I've edited the question with the new results. – Darkobra Oct 02 '16 at 16:42
1

Along with above answers, I will suggest you that try a cost function tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)

As binary classification, you must try the sigmoid_cross_entropy_with_logits cost function

I will also suggest you must also plot line graph of accuracy of train and test vs number of epochs. i.e check whether the model is overfitting?

If its not overfitting, try to make your neural net more complex. That is by increasing number of neurons, increasing number of layers. You will get such a point beyond that your training accuracy will keep increasing but validation will not that point will give the best model.

Pramod Patil
  • 757
  • 2
  • 10
  • 26
  • Hey Pramod, thanks for your reply. I was reading about this cost function you mentioned, but the descriptions says it is best suited where the labels are not mutually exclusive - but in my model they are. I'm now adjusting my network with the help of TensorBoard and I will surely try to make my net more complex. – Darkobra Oct 04 '16 at 20:17
  • As per question "I have a very large dataset (about 1,5*10^6 examples) each with a binary (0/1) label ". It is binary class classification with each instance is either true(1) or false(0). What do you mean by mutually exclusive? I am unable to get it. – Pramod Patil Oct 05 '16 at 04:33
  • I think you are talking about this "Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. " and I think as per your description your labels are not mutually exclusive and independent. check out this: http://stats.stackexchange.com/questions/107768/what-is-the-difference-between-a-multi-label-and-a-multi-class-classification – Pramod Patil Oct 05 '16 at 04:43
  • I am sorry that my description misled you to think that my labels are not mutually exclusive, but they are. To give context, I had large dataset of tweets each labeled with a sentiment, that can be "positive" (1) or "negative" (0). – Darkobra Oct 05 '16 at 13:38