1

I want to take a closer look at the Jacobians of each layer in a fully connected neural network, i.e. ∂y/∂x where x is the input vector (activations previous layer) to the layer and y is the output vector (activations this layer) of that layer.

In an online learning scheme, this could be easily done as follows:

import theano
import theano.tensor as T
import numpy as np

x = T.vector('x')
w = theano.shared(np.random.randn(10, 5))
y = T.tanh(T.dot(w, x))

# computation of Jacobian
j = T.jacobian(y, x)

When learning on batches, you need an additional scan to get the Jacobian for each sample

x = T.matrix('x')
...

# computation of Jacobian
j = theano.scan(lambda i, a, b : jacobian(b[i], a)[:,i], 
    sequences = T.arange(y.shape[0]), non_sequences = [x, y]
)

This works perfectly well for toy examples, but when learning a network with multiple layers with 1000 hidden units and for thousands of samples, this approach leads to a massive slowdown of the computations. (The idea behind indexing the result of the Jacobian can be found in this question)

The thing is that I believe there is no need for this explicit Jacobian computation when we are already computing the derivative of the loss. After all, the gradient of the loss with regard to e.g. the inputs of the network, can be decomposed as
∂L(y,yL)/∂x = ∂L(y,yL)/∂yL ∂yL/∂y(L-1) ∂y(L-1)/∂y(L-2) ... ∂y2/∂y1 ∂y1/∂x
i.e. the gradient of the loss w.r.t. x is the product of derivatives of each layer (L would be the number of layers here).

My question is thus whether (and how) it is possible to avoid the extra computation and use the decomposition discussed above. I assume it should be possible, because automatic differentiation is practically an application of the chain rule (for as far I understood it). However, I don't seem to find anything that could back this idea. Any suggestions, hints or pointers?

1 Answers1

0

T.jacobian is very inefficient because it uses scan internally. If you plan to multiply jacobian matrix with something, you should use T.Lop or T.Rop for left / right multiplication respectively. Currently "smart" jacobian does not exist theano's in gradient module. You have to hand craft them if you want optimized jacobian.

Instead of using T.scan, use batched Op such asT.batched_dot when possible. T.scan will always results in a CPU loop.

Kh40tiK
  • 2,276
  • 19
  • 29