Backpropagation algorithm through cross-channel local response normalization (LRN) layer

Question

I am working on replicating a neural network. I'm trying to get an understanding of how the standard layer types work. In particular, I'm having trouble finding a description anywhere of how cross-channel normalisation layers behave on the backward-pass.

Since the normalization layer has no parameters, I could guess two possible options:

The error gradients from the next (i.e. later) layer are passed backwards without doing anything to them.
The error gradients are normalized in the same way the activations are normalized across channels in the forward pass.

I can't think of a reason why you'd do one over the other based on any intuition, hence why I'd like some help on this.

EDIT1:

The layer is a standard layer in caffe, as described here http://caffe.berkeleyvision.org/tutorial/layers.html (see 'Local Response Normalization (LRN)').

The layer's implementation in the forward pass is described in section 3.3 of the alexNet paper: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

EDIT2:

I believe the forward and backward pass algorithms are described in both the Torch library here: https://github.com/soumith/cudnn.torch/blob/master/SpatialCrossMapLRN.lua

and in the Caffe library here: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/lrn_layer.cpp

Please could anyone who is familiar with either/both of these translate the method for the backward pass stage into plain english?

Can you link to a reference about "cross-channel normalisation layers"? Google only reveals an arxiv paper that appears to talk about a lot of other things as well. It hardly seems like a standard layer type. — IVlad, Nov 18 '15 at 17:11

score 5 · Accepted Answer · answered Nov 28 '15 at 03:30

It uses the chain rule to propagate the gradient backwards through the local response normalization layer. It is somewhat similar to a nonlinearity layer in this sense (which also doesn't have trainable parameters on its own, but does affect gradients going backwards).

From the code in Caffe that you linked to I see that they take the error in each neuron as a parameter, and compute the error for the previous layer by doing following:

First, on the forward pass they cache a so-called scale, that is computed (in terms of AlexNet paper, see the formula from section 3.3) as:

scale_i = k + alpha / n * sum(a_j ^ 2)

Here and below sum is sum indexed by j and goes from max(0, i - n/2) to min(N, i + n/2)

(note that in the paper they do not normalize by n, so I assume this is something that Caffe does differently than AlexNet). Forward pass is then computed as b_i = a_i + scale_i ^ -beta.

To backward propagate the error, let's say that the error coming from the next layer is be_i, and the error that we need to compute is ae_i. Then ae_i is computed as:

ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * sum(be_j * b_j / scale_j)

Since you are planning to implement it manually, I will also share two tricks that Caffe uses in their code that makes the implementation simpler:

When you compute the addends for the sum, allocate an array of size N + n - 1, and pad it with n/2 zeros on each end. This way you can compute the sum from i - n/2 to i + n/2, without caring about going below zero and beyond N.
You don't need to recompute the sum on each iteration, instead compute the the addends in advance (a_j^2 for the front pass, be_j * b_j / scale_j for the backward pass), then compute the sum for i = 0, and then for each consecutive i just add addend[i + n/2] and subtract addend[i - n/2 - 1], it will give you the value of the sum for the new value of i in constant time.

shouldn't this be `b_i = a_i * scale_i ^ -beta`? – Christoph Körner Jul 25 '16 at 18:13 — Christoph Körner, Jul 25 '16 at 18:13

score -1 · Answer 2 · answered Nov 24 '15 at 02:22

-1

Of cause,you can either print the variables to observe the changes with them or use the debug model to see how errors change during passing the net.

answered Nov 24 '15 at 02:22

fei

1
3

score -1 · Answer 3 · edited Dec 16 '16 at 22:37

-1

I have an alternative formulation of the backward and I don't know if it is equivalent to caffe's:

So caffe's is :

ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * sum(be_j * b_j / scale_j)

by differentiating the original expression

b_i = a_i/(scale_i^-b)

I get

ae_i = scale_i ^ -b * be_i - (2 * alpha * beta / n) * a_i * be_i*sum(ae_j)/scale_i^(-b-1)

edited Dec 16 '16 at 22:37

Paul Roub

36,322
27
84
93

answered Dec 16 '16 at 22:31

Anand Venkat

1

Backpropagation algorithm through cross-channel local response normalization (LRN) layer

3 Answers3