Why is this TensorFlow implementation vastly less successful than Matlab's NN?

Question

As a toy example I'm trying to fit a function f(x) = 1/x from 100 no-noise data points. The matlab default implementation is phenomenally successful with mean square difference ~10^-10, and interpolates perfectly.

I implement a neural network with one hidden layer of 10 sigmoid neurons. I'm a beginner at neural networks so be on your guard against dumb code.

import tensorflow as tf
import numpy as np

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

#Can't make tensorflow consume ordinary lists unless they're parsed to ndarray
def toNd(lst):
    lgt = len(lst)
    x = np.zeros((1, lgt), dtype='float32')
    for i in range(0, lgt):
        x[0,i] = lst[i]
    return x

xBasic = np.linspace(0.2, 0.8, 101)
xTrain = toNd(xBasic)
yTrain = toNd(map(lambda x: 1/x, xBasic))

x = tf.placeholder("float", [1,None])
hiddenDim = 10

b = bias_variable([hiddenDim,1])
W = weight_variable([hiddenDim, 1])

b2 = bias_variable([1])
W2 = weight_variable([1, hiddenDim])

hidden = tf.nn.sigmoid(tf.matmul(W, x) + b)
y = tf.matmul(W2, hidden) + b2

# Minimize the squared errors.
loss = tf.reduce_mean(tf.square(y - yTrain))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# For initializing the variables.
init = tf.initialize_all_variables()

# Launch the graph
sess = tf.Session()
sess.run(init)

for step in xrange(0, 4001):
    train.run({x: xTrain}, sess)
    if step % 500 == 0:
        print loss.eval({x: xTrain}, sess)

Mean square difference ends at ~2*10^-3, so about 7 orders of magnitude worse than matlab. Visualising with

xTest = np.linspace(0.2, 0.8, 1001)
yTest = y.eval({x:toNd(xTest)}, sess)  
import matplotlib.pyplot as plt
plt.plot(xTest,yTest.transpose().tolist())
plt.plot(xTest,map(lambda x: 1/x, xTest))
plt.show()

we can see the fit is systematically imperfect: while the matlab one looks perfect to the naked eye with the differences uniformly < 10^-5: I have tried to replicate with TensorFlow the diagram of the Matlab network:

Incidentally, the diagram seems to imply a tanh rather than sigmoid activation function. I cannot find it anywhere in documentation to be sure. However, when I try to use a tanh neuron in TensorFlow the fitting quickly fails with nan for variables. I do not know why.

Matlab uses Levenberg–Marquardt training algorithm. Bayesian regularization is even more successful with mean squares at 10^-12 (we are probably in the area of vapours of float arithmetic).

Why is TensorFlow implementation so much worse, and what can I do to make it better?

I haven't looked into tensor flow yet, so sorry about that, but you're doing some bizarre things with numpy there with that `toNd` function. `np.linspace` already return an ndarray, not a list, if you want to convert a list to an an ndarray, all you need to do is `np.array(my_list)`, and if you just need the extra axis, you can do `new_array = my_array[np.newaxis, :]`. It might just be stopping short of zero error because it's supposed to do that. Most data has noise and you don't necessarily want zero training error on it. Judging by 'reduce_mean,' it may be using cross-validation. — Adam Acosta, Nov 15 '15 at 14:34
@AdamAcosta `toNd` is definitely a stop-gap for my lack of experience. I tried `np.array` before and the problem seems to be that `np.array([5,7]).shape` is `(2,)` and not `(2,1)`. `my_array[np.newaxis, :]` seems to correct this, thanks! I do not use python but rather F# day-to-day. — Arbil, Nov 15 '15 at 14:48
@AdamAcostaI I don't think `reduce_mean` does cross-validation. From the docs: `Computes the mean of elements across dimensions of a tensor`. Matlab does cross-validation which to my mind should reduce the fit on the training sample compared to no cross-validation, is that right? — Arbil, Nov 15 '15 at 14:55
Yeah, cross-validation should normally prevent a perfect fit. Sorry for the lack of a real answer. Knowledge of tensor flow is still pretty sparse. I've seen a lot of questions come up about it lately and not too many answers. Udacity is developing a course on it as part of their new machine learning engineer nanodegree. I swear I don't work for Udacity but it might be worth looking into! — Adam Acosta, Nov 15 '15 at 16:02

Yaroslav Bulatov · Accepted Answer · 2015-11-15T23:57:47.907

25

I tried training for 50000 iterations it got to 0.00012 error. It takes about 180 seconds on Tesla K40.

It seems that for this kind of problem, first order gradient descent is not a good fit (pun intended), and you need Levenberg–Marquardt or l-BFGS. I don't think anyone implemented them in TensorFlow yet.

Edit Use tf.train.AdamOptimizer(0.1) for this problem. It gets to 3.13729e-05 after 4000 iterations. Also, GPU with default strategy also seems like a bad idea for this problem. There are many small operations and the overhead causes GPU version to run 3x slower than CPU on my machine.

edited Nov 15 '15 at 23:57

answered Nov 15 '15 at 18:34

Yaroslav Bulatov

57,332
22
139
197

Thanks for checking this out. Do you mean 5000 of my loops, so 20M basic training runs? Can you confirm that it fails when changing the hidden layer to tanh neurons, and if so, do you know why it happens? – Arbil Nov 15 '15 at 18:55
1

I just changed your xrange(4001) to xrange(5000). For tanh, it looks like the training diverges with learning rate 0.5. In general for gradient descent you need to tune your learning rate for each problem, it seems to work if I do tf.train.GradientDescentOptimizer(0.1) – Yaroslav Bulatov Nov 15 '15 at 19:29
I see about the gradient parameter. It's very strange xrange(0, 5000) gives you an order of magnitude better accuracy than 4k range and it takes 180s on a GPU. I run the same range on CPU with accuracy unchanged and it takes less than 10s. – Arbil Nov 15 '15 at 19:38
oops, typo, 50000, not 5000 – Yaroslav Bulatov Nov 15 '15 at 20:17
so just tried something different, optimizer = tf.train.AdamOptimizer(0.1) seems to do much better, 3.13729e-05 after 4000 iterations – Yaroslav Bulatov Nov 15 '15 at 20:22
Thanks. I thought I checked this one but probably tried it with a wrong parameter. Next in my pipeline is reading about the optimizing algorithms then. – Arbil Nov 15 '15 at 20:28
2

Also - changing your datatype from float32 to float64, adjusting adamoptimizer to use an exponentially decaying learning rate stepping down from 0.2 with exp decay 0.9999 gets 1.44e-05 after 4000 training steps. step = tf.Variable(0, trainable=False) rate = tf.train.exponential_decay(0.2, step, 1, 0.9999) optimizer = tf.train.AdamOptimizer(rate) train = optimizer.minimize(loss, global_step=step) – dga Nov 15 '15 at 21:01
Now you can use scipy from TensorFlow: https://www.tensorflow.org/api_docs/python/tf/contrib/opt/ScipyOptimizerInterface – quant_dev Nov 27 '17 at 05:28

dga · Answer 2 · 2016-04-25T17:09:53.563

btw, here's a slightly cleaned up version of the above that cleans up some of the shape issues and unnecessary bouncing between tf and np. It achieves 3e-08 after 40k steps, or about 1.5e-5 after 4000:

import tensorflow as tf
import numpy as np

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

xTrain = np.linspace(0.2, 0.8, 101).reshape([1, -1])
yTrain = (1/xTrain)

x = tf.placeholder(tf.float32, [1,None])
hiddenDim = 10

b = bias_variable([hiddenDim,1])
W = weight_variable([hiddenDim, 1])

b2 = bias_variable([1])
W2 = weight_variable([1, hiddenDim])

hidden = tf.nn.sigmoid(tf.matmul(W, x) + b)
y = tf.matmul(W2, hidden) + b2

# Minimize the squared errors.                                                                
loss = tf.reduce_mean(tf.square(y - yTrain))
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.15, step, 1, 0.9999)
optimizer = tf.train.AdamOptimizer(rate)
train = optimizer.minimize(loss, global_step=step)
init = tf.initialize_all_variables()

# Launch the graph                                                                            
sess = tf.Session()
sess.run(init)

for step in xrange(0, 40001):
    train.run({x: xTrain}, sess)
    if step % 500 == 0:
        print loss.eval({x: xTrain}, sess)

All that said, it's probably not too surprising that LMA is doing better than a more general DNN-style optimizer for fitting a 2D curve. Adam and the rest are targeting very high dimensionality problems, and LMA starts to get glacially slow for very large networks (see 12-15).

Why is this TensorFlow implementation vastly less successful than Matlab's NN?

2 Answers2

Linked