5

Suppose I have an artificial neural networks with 5 hidden layers. For the moment, forget about the details of the neural network model such as biases, the activation functions used, type of data and so on ... . Of course, the activation functions are differentiable.

With symbolic differentiation, the following computes the gradients of the objective function with respect to the layers' weights:

w1_grad = T.grad(lost, [w1])
w2_grad = T.grad(lost, [w2])
w3_grad = T.grad(lost, [w3])
w4_grad = T.grad(lost, [w4])
w5_grad = T.grad(lost, [w5])
w_output_grad = T.grad(lost, [w_output])

This way, to compute the gradients w.r.t w1 the gradients w.r.t w2, w3, w4 and w5 must first be computed. Similarly to compute the gradients w.r.t w2 the gradients w.r.t w3, w4 and w5 must be computed first.

However, I could the following code also computes the gradients w.r.t to each weight matrix:

w1_grad, w2_grad, w3_grad, w4_grad, w5_grad, w_output_grad = T.grad(lost, [w1, w2, w3, w4, w5, w_output])

I was wondering, is there any difference between these two methods in terms of performance? Is Theano intelligent enough to avoid re-computing the gradients using the second method? By intelligent I mean to compute w3_grad, Theano should [preferably] use the pre-computed gradients of w_output_grad, w5_grad and w4_grad instead of computing them again.

Amir
  • 10,600
  • 9
  • 48
  • 75
  • Interesting question. Have you simply tried out and measured the runtime for both methods? – hbaderts Dec 23 '15 at 21:31
  • @hbaderts Not yet, I will do some experiments and post the performance results here. – Amir Dec 24 '15 at 15:29
  • @hbaderts Solution is posted now. – Amir Dec 27 '15 at 21:52
  • thanks for the insight, nice question and answer. – hbaderts Dec 28 '15 at 09:39
  • 2
    Amir, you've been editing new tags (jacobean, hessian) into a bunch of old questions, but I'm not convinced the tags are necessary - I think they are ["meta tags"](http://stackoverflow.com/help/tagging), as they can't really stand on their own. They certainly aren't important enough to justify tag-only edits on old posts that have lots of other problems. I suggest you make a meta post to open a discussion about whether the tags are welcome; if there is community support for adding them, it won't need to be a one-man effort. – Mogsdad Jan 18 '16 at 17:08
  • @Mogsdad For Hessian, it was certainly necessary as there is a software named "hessian" which has something to do with network/internet. So, most questions were mixed up and I separated them out by creating the tag "hessian-matrix". For Jacobian-matrix, there certainly is the need to have a new tag for it. Jacobian is a very important mathematical term and questions that have something to do with mathematical-optimization and ask something related to the Jacobian matrix are just all over the place. – Amir Jan 18 '16 at 17:11
  • Mathematical terms might be important enough to warrant their own tags on a math site, but this is a programming site. I recommend a meta.SO question as well, especially considering the tags have no usage guidance or descriptions. – TylerH Jan 18 '16 at 17:18
  • @TylerH Well you might be right, but only to some extent. You might not be involved into mathematics and that's why you complain about what I've done. I do a lot of programming and my programs involve dealing with such math, like many others who have posted a lot of questions related to mathematical optimization. These questions must be categorized properly so that people can find them easier and they will pop up on search engines when someone looks for those terms. – Amir Jan 18 '16 at 17:21
  • 1
    @Amir And we aren't saying categorically that they *shouldn't* ever exist here, merely that you start a discussion on meta, because we are hesitant to agree with you about the meta status of such tags, as it were. If you can provide a compelling argument there, along with a good tag description that offers accurate usage guidance (where currently there is none), then as Mogsdad said, others can even assist you with adding these tags to appropriate questions. – TylerH Jan 18 '16 at 17:25

1 Answers1

4

Well it turns out Theano does not take the previously-computed gradients to compute the gradients in lower layers of a computational graph. Here's a dummy example of a neural network with 3 hidden layers and an output layer. However, it's not going to be a big deal at all since computing the gradients is a once-in-a-life-time operation unless you have to compute the gradient on each iteration. Theano returns a symbolic expression for the derivatives as a computational graph and you can simply use it as a function from that point on. From that point on we simply use the function derived by Theano to compute numerical values and update the weights using those.

import theano.tensor as T
import time
import numpy as np
class neuralNet(object):
    def __init__(self, examples, num_features, num_classes):
        self.w = shared(np.random.random((16384, 5000)).astype(T.config.floatX), borrow = True, name = 'w')
        self.w2 = shared(np.random.random((5000, 3000)).astype(T.config.floatX), borrow = True, name = 'w2')
        self.w3 = shared(np.random.random((3000, 512)).astype(T.config.floatX), borrow = True, name = 'w3')
        self.w4 = shared(np.random.random((512, 40)).astype(T.config.floatX), borrow = True, name = 'w4')
        self.b = shared(np.ones(5000, dtype=T.config.floatX), borrow = True, name = 'b')
        self.b2 = shared(np.ones(3000, dtype=T.config.floatX), borrow = True, name = 'b2')
        self.b3 = shared(np.ones(512, dtype=T.config.floatX), borrow = True, name = 'b3')
        self.b4 = shared(np.ones(40, dtype=T.config.floatX), borrow = True, name = 'b4')
        self.x = examples

        L1 = T.nnet.sigmoid(T.dot(self.x, self.w) + self.b)
        L2 = T.nnet.sigmoid(T.dot(L1, self.w2) + self.b2)
        L3 = T.nnet.sigmoid(T.dot(L2, self.w3) + self.b3)
        L4 = T.dot(L3, self.w4) + self.b4
        self.forwardProp = T.nnet.softmax(L4)
        self.predict = T.argmax(self.forwardProp, axis = 1)

    def loss(self, y):
        return -T.mean(T.log(self.forwardProp)[T.arange(y.shape[0]), y])

x = T.matrix('x')
y = T.ivector('y')

nnet = neuralNet(x)
loss = nnet.loss(y)

diffrentiationTime = []
for i in range(100):
    t1 = time.time()
    gw, gw2, gw3, gw4, gb, gb2, gb3, gb4 = T.grad(loss, [nnet.w, nnet.w2, logReg.w3, nnet.w4, nnet.b, nnet.b2, nnet.b3, nnet.b4])
    diffrentiationTime.append(time.time() - t1)
print 'Efficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))

diffrentiationTime = []
for i in range(100):
    t1 = time.time()
    gw = T.grad(loss, [nnet.w])
    gw2 = T.grad(loss, [nnet.w2])
    gw3 = T.grad(loss, [nnet.w3])
    gw4 = T.grad(loss, [nnet.w4])
    gb = T.grad(loss, [nnet.b])
    gb2 = T.grad(loss, [nnet.b2])
    gb3 = T.grad(loss, [nnet.b3])
    gb4 = T.grad(loss, [nnet.b4])
    diffrentiationTime.append(time.time() - t1)
print 'Inefficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))

This will print out the followings:

Efficient Method: Took 0.061056 seconds with std 0.013217
Inefficient Method: Took 0.305081 seconds with std 0.026024

This shows that Theano uses a dynamic-programming approach to compute gradients for the efficient method.

Amir
  • 10,600
  • 9
  • 48
  • 75