Back propagation algorithm: error computation

Question

I am currently writing a back propagation script. I am unsure how to go about updating my weight values. Here is an image just to make things simple.

enter image description here

My question: How is the error calculated and applied?

I do know that k1 and k2 produce error values. I know that k1 and k2 produce individual error values (target - output). I do not however know if these are to be used.

Am I supposed to use the mean value of both error values and then apply that single error value to all of the weights?

Or am I supposed to:

update weight Wk1j1 and Wk1j2 with the error value of k1
update weight Wk2j1 and Wk2j2 with the error value of k2
update weight Wj1i1 and Wj1i2 with the error value of j1
update weight Wj2i1 and Wj2i2 with the error value of j2

Before you start shooting, I understand that I must use sigmoids function etc. THIS IS NOT THE QUESTION. It always states that I have to calculate the error value for the outputs, this is where I am confused.

and then get the net error value by:

((error_k1^2) + (error_k2^2) + (error_j1^2) + (error_j2^2)) / 2

From Wiki:

enter image description here

As the image states this is true for each of the output nodes, in my image example k1 and k2. The wiki.

The two rows under the image is delta Wh and delta Wi. Which error value am I supposed to use (this is basically my question, which error value am I supposed to calculate the new weight with)

Answer:

http://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf page 3(notad as 18) #4

You are quite far away from a usable backprop algorithm. I started to answer and then realised I was just writing a tutorial on backprop. I suggest instead you learn from the masters, and look at Week 3 & 4 content of https://www.coursera.org/course/ml which not only explains the algorithm well, but also gets you to implement it using Matlab — Neil Slater, Nov 20 '14 at 12:04
@NeilSlater I am nearly finished with my matlab learning script. I can upload it if it would make any more sense to you but I doubt it, I chose to try and keep the question as simple as possible. - I tried your advice, the coursera did not help. My question is simple, I am afraid people might be over analyzing it. I updated with another image and additional text! — basickarl, Nov 20 '14 at 12:39
Your question is unfortunately not simple, you seem to have some quite wrong ideas about back propagation, and to correct them means going back to basics. As I had nearly written the answer anyway, I decided to upload it. — Neil Slater, Nov 20 '14 at 12:46

Neil Slater · Answer 1 · 2014-11-20T23:39:54.333

Back-propagation does not use the error values directly. What you back-propagate is the partial derivative of the error with respect to each element of the neural network. Eventually that gives you dE/dW for each weight, and you make a small step in the direction of that gradient.

To do this, you need to know:

The activation value of each neuron (kept from when doing the feed-forward calculation)
The mathematical form of the error function (e.g. it may be a sum of squares difference). Your first set of derivatives will be dE/da for the output layer (where E is your error and a is the output of the neuron).
The mathematical form of the neuron activation or transfer function. This is where you discover why we use sigmoid because dy/dx of the sigmoid function can conveniently be expressed in terms of the activation value, dy/dx = y * (1 - y) - this is fast and also means you don't have to store or re-calculate the weighted sum.

Please note, I am going to use different notation from you, because your labels make it hard to express the general form of back-propagation.

In my notation:

Superscripts in brackets (k) or (k+1) identify a layer in the network.
There are N neurons in layer (k), indexed with subscript i
There are M neurons in layer (k+1), indexed with subscript j
The sum of inputs to a neuron is z
The output of a neuron is a
A weight is W_ij and connects a_i in layer (k) to z_j in layer (k+1). Note W_0j is the weight for bias term, and sometimes you need to include that, although your diagram does not show bias inputs or weights.

With the above notation, the general form of the back-propagation algorithm is a five-step process:

1) Calculate initial dE/da for each neuron in the output layer. Where E is your error value, and a is the activation of the neuron. This depends entirely on your error function.

Then, for each layer (start with k = maximum, your output layer)

2) Backpropagate dE/da to dE/dz for each neuron (where a is your neuron output and z is the sum of all inputs to it including the bias) within a layer. In addition to needing to know the value from (1) above, this uses the derivative of your transfer function:

enter image description here

(Now reduce k by 1 for consistency with the remainder of the loop):

3) Backpropagate dE/dz from an upper layer to dE/da for all outputs in previous layer. This basically involves summing across all weights connecting that output neuron to the inputs in the upper layer. You don't need to do this for the input layer. Note how it uses the value you calculated in (2)

enter image description here

4) (Independently of (3)) Backpropagate dE/dz from an upper layer to dE/dW for all weights connecting that layer to the previous layer (this includes the bias term):

enter image description here

Simply repeat 2 to 4 until you have dE/dW for all your weights. For more advanced networks (e.g. recurrent), you can add in other error sources by re-doing step 1.

5) Now you have the weight derivatives, you can simply subtract them (times a learning rate) to take a step towards what you hope is the error function minimum:

enter image description here

The maths notation can seem a bit dense in places the first time you see this. But if you look a few times, you will see there are essentially only a few variables, and they are indexed by some combination of i, j, k values. In addition, with Matlab, you can express vectors and matrices really easily. So for instance this is what the whole process might look like for learning a single training example:

clear ; close all; clc; more off

InputVector          = [ 0.5, 0.2 ];
TrainingOutputVector = [ 0.1, 0.9 ];

learn_rate = 1.0;
W_InputToHidden  = randn( 3, 2 ) * 0.6;
W_HiddenToOutput = randn( 3, 2 ) * 0.6;

for i=1:20,
    % Feed-forward input to hidden layer
    InputsPlusBias = [1, InputVector];
    HiddenActivations = 1.0 ./ (1.0 + exp(-InputsPlusBias * W_InputToHidden));

    % Feed-forward hidden layer to output layer
    HiddenPlusBias = [ 1, HiddenActivations ];
    OutputActivations = 1.0 ./ (1.0 + exp(-HiddenPlusBias * W_HiddenToOutput));

    % Backprop step 1: dE/da for output layer (assumes mean square error)
    OutputActivationDeltas = OutputActivations - TrainingOutputVector;

    nn_error = sum( OutputActivationDeltas .* OutputActivationDeltas ) / 2;
    fprintf( 'Epoch %d, error %f\n', i, nn_error);

    % Steps 2 & 3 combined:
    % Back propagate dE/da on output layer to dE/da on hidden layer
    % (uses sigmoid  derivative)
    HiddenActivationDeltas = ( OutputActivationDeltas * W_HiddenToOutput(2:end,:)'
      .* ( HiddenActivations .* (1 - HiddenActivations) ) );

    % Steps 2 & 4 combined (twice):
    % Back propagate dE/da to dE/dW
    W_InputToHidden_Deltas  = InputsPlusBias' * HiddenActivationDeltas;
    W_HiddenToOutput_Deltas = HiddenPlusBias' * OutputActivationDeltas;

    % Step 5: Alter the weights
    W_InputToHidden  -= learn_rate * W_InputToHidden_Deltas;
    W_HiddenToOutput -= learn_rate * W_HiddenToOutput_Deltas;
end;

As written this is stochastic gradient descent (weights altering once per training example), and obviously is only learning one training example.

Apologies for pseudo-math notation in places. Stack Overflow doesn't have simple built in LaTex-like maths, unlike Math Overflow. I have skipped some of the derivation/explanation for steps 3 and 4 to avoid this answer taking forever.

I actually used my example to get away from mathematical formulas which I find irritating (I prefer seeing lines of code, for loops, etc.etc. its what I understand). There are countless examples of them on the internet as it is! — basickarl, Nov 20 '14 at 13:06
http://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf - this pdf, page 3(noted as page 18), #4 is EXACTLY what I was wanting to know. — basickarl, Nov 20 '14 at 13:07
@KarlMorrison: You cannot really escape from the maths if you want to *understand* backprop, and not just implement it from a code recipe book. I found it dense too to start with, but it is really easy to convert to/from with Matlab. I have edited to include an example. Most importantly, each of the partial derivatives looks scary, but can just be reduced to a vector or matrix with a suitable name. — Neil Slater, Nov 20 '14 at 13:11
The PDF looks correct to me, sorry if the answer was too full of equations. — Neil Slater, Nov 20 '14 at 13:14
I do understand fully what your saying Neil and you are correct on that(about fully understanding *what* is happening and why it works). The way getting there however is different for everybody. What I hate most about math is that everybody uses different math notation. My professor has copy pasted from other slides and written on the whiteboard, everything has different notation the entire time. - Probably the reason why I love programming, notation is more in order. — basickarl, Nov 20 '14 at 13:17
Don't worry about the notation Neil! Someone might need it! :) — basickarl, Nov 20 '14 at 13:17
Thought I may as well add a Matlab implementation. Not sure if it helps. — Neil Slater, Nov 20 '14 at 15:59

Shashi Sathyanarayana · Answer 2 · 2014-11-27T20:36:17.300

Getting to understanding how a neural network (multilayer perceptron, specifically) does its magic can be deeply satisfying to a student of machine learning. Especially so if you can actually make one work and solve simple classification problems. Given your strong interest, you will definitely succeed.

Once you appreciate the fact that, in order to train a neural network, you need to somehow calculate the partial derivatives of the error with respect to weights, backpropagation can be easily and qualitatively derived by reducing it to three core concepts: (1) Boxing (2) Sensitivity and (3) Weight Updates. You already have some idea that (3) is necessary.

I agree that irritating and dense mathematical formulas ("What does this symbol mean again?") keep most students from enjoying this journey. In fact, Bernard Widrow, one of the pioneers in this area said so himself.

In a whitepaper that I authored earlier this year (no numbered equations!), I have tried my best to devise an intuitive notation that makes it easy to connect to the concept being symbolized. Something along the lines of calling the input I, the output O, the target T etc.

Reading this might help you with your question (and more): A Gentle Introduction to Backpropagation. This article contains pseudocode ("Training Wheels for Training Neural Networks") for implementing the algorithm.

The pseudocode will guide you through the following easily understood steps:

Initialize the weights and set the learning rate and the stopping criteria.
Randomly choose an input and the corresponding target.
Compute the input to each layer and the output of the final layer.
Compute the sensitivity components.
Compute the gradient components and update the weights.
Check against the stopping criteria. Exit and return the weights or loop back to Step 2.

Back propagation algorithm: error computation

2 Answers2

Linked