TensorFlow or Theano: how do they know the loss function derivative based on the neural network graph?

Question

In TensorFlow or Theano, you only tell the library how your neural network is, and how feed-forward should operate.

For instance, in TensorFlow, you would write:

with graph.as_default():
    _X = tf.constant(X)
    _y = tf.constant(y)

    hidden = 20
    w0 = tf.Variable(tf.truncated_normal([X.shape[1], hidden]))
    b0 = tf.Variable(tf.truncated_normal([hidden]))

    h = tf.nn.softmax(tf.matmul(_X, w0) + b0)

    w1 = tf.Variable(tf.truncated_normal([hidden, 1]))
    b1 = tf.Variable(tf.truncated_normal([1]))

    yp = tf.nn.softmax(tf.matmul(h, w1) + b1)

    loss = tf.reduce_mean(0.5*tf.square(yp - _y))
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

I am using L2-norm loss function, C=0.5*sum((y-yp)^2), and in the backpropagation step presumably the derivative will have to be computed, dC=sum(y-yp). See (30) in this book.

My question is: how can TensorFlow (or Theano) know the analytical derivative for backpropagation? Or do they do an approximation? Or somehow do not use the derivative?

I have done the deep learning udacity course on TensorFlow, but I am still at odds at how to make sense on how these libraries work.

Here is a related post https://stackoverflow.com/questions/44210561/how-do-backpropagation-works-in-tensorflow — Anton Codes, Oct 16 '19 at 15:36

score 13 · Accepted Answer · answered Feb 11 '16 at 16:37

13

The differentiation happens in the final line:

    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

When you execute the minimize() method, TensorFlow identifies the set of variables on which loss depends, and computes gradients for each of these. The differentiation is implemented in ops/gradients.py, and it uses "reverse accumulation". Essentially it searches backwards from the loss tensor to the variables, applying the chain rule at each operator in the dataflow graph. TensorFlow includes "gradient functions" for most (differentiable) operators, and you can see an example of how these are implemented in ops/math_grad.py. A gradient function can use the original op (including its inputs, outputs, and attributes) and the gradients computed for each of its outputs to produce gradients for each of its inputs.

Page 7 of Ilya Sutskever's PhD thesis has a nice explanation of how this process works in general.

answered Feb 11 '16 at 16:37

mrry

125,488
26
399
400

I wonder if the derivative can be calculated automatically whether I define my loss function via python or via tf operation? – lhao0301 Oct 26 '16 at 06:06
It knows all the individual derivatives (in case of existence) of the tf operations, not of the python operations. – Jan van der Vegt Jan 19 '17 at 10:51
So the loss function must be a loss function provided by the TensorFlow library? Because TensorFlow already stores the expression of the loss function's first-order derivative? Does PyTorch work similarly? I noticed in the example of reverse accumulation within the Wikipedia link, the first derivative of `sin()` is already known to be `cos()`. – E. Kaufman Jul 01 '21 at 20:51

TensorFlow or Theano: how do they know the loss function derivative based on the neural network graph?

1 Answers1

Linked