-1

I've been trying to learn the math behind neural networks and have implemented (in Octave) a version of the following equations which include bias terms.

Back-propagation equations matrix form:

Back-propagation equations matrix form

Visual representation of the problem and Network:

Visual representation of the problem and Network

clear;  clc;  close all;
#Initialize weights and bias from input to hidden layer
W1 = rand(3,4)
b1 = ones(3,1)
#Initialize weights from hidden to output
W2 = rand(2,3)
b2 = ones(2,1)

#define sigmoid function
s = @(z) 1./(1 + exp(-z));
ds = @(z) s(z).*(1-s(z));

data = csvread("data.txt");

for j = 1 : 100
  for i = 1 : length(data)
      x0 = data(i,2:5)';

      #Find the truth
      if data(i,6) == 1 ;
        t = [1;0]   ;
      else
        t = [0;1];
      end

      #Forward propagate
      x1 = s(W1*x0 + b1);
      x2 = s(W2*x1 + b2);
      iter = (j-1)*length(data) + i;
      E((j-1)*length(data) + i) = norm(x2-t)^2;
      E(length(E))

      #Back propagate
      delta2 = (x2-t).*ds(W2*x1+b2);
      delta1 = W2'*delta2.*ds(W1*x0+b1);

      dedw2 = delta2*x1';
      dedw1 = delta1*x0';

      alpha = 0.001*(40000-iter)/40000;
      W2 = W2 - alpha*dedw2;
      W1 = W1 - alpha*dedw1;
      b2 = b2 - alpha*delta2;
      b1 = b1 - alpha*delta1;
  end
end
plot(E)
title('Gradient Descent')
xlabel('Iteration')
ylabel('Error')

When I run this, I converge on weights that give an constant error of 0.5 rather than 0.0. The error plot looks something like this depending on the initial samples of W1 and W2:

this

The resulting weights W1 and W2 yield output ~[0.5,0.5] for the whole set rather than [1,0](isStairs = true) or [0,1](isStairs = False)

Other information:

  • If I loop over a single data point instead of the entire learning set, it does converge to zero error for that particular case. (like 20 iterations or so), so I assume my derivatives are correct?
  • For the model to converge the learning rate has to be insanely small. Not sure what this means.

Is this neural network valid to solve the described problem? If so, what does it mean to converge to an error of 0.5?

Sardar Usama
  • 19,536
  • 9
  • 36
  • 58

1 Answers1

0

The NN learns from data. If there is only one example, it will learn this example by heard and you have zero error. But if you have more examples, they will likely not lie on a nice curve, but are noisy instead. So it is harder to learn the data by heard for the network (it also depends on the number of free parameters that the NN has but you get the idea)... However, you don't want the NN to learn everything in detail. You want it to learn the overall trend (so not the noise). But this also means, that your error won't converge to zero as there is noise, which your NN should not learn... So don't worry if you have a (small) error at the end.

But what about the learning rate? Well, imagine you have 10 examples. Eight of them describe a perfect line but two exhibit noise. One sightly to the right (lets say +1) and the other slightly to the left (-1). If the NN estimates one of those points and updates to minimize the error drawn from it. The update will jump from + to - or vice versa. Depending on your learning rate, this jumping may eventually converge to the middle (which is the correct function) or may go on forever... This is essentially what the learning rate does: it determines how much impact an estimation error has on the update/learning of the network. So a good idea is to choose a larger learning rate the the beginning (where the network has a really bad performance due to its random initialization) and decrease the rate when it already learned something. You can achieve the same thing with a small learning rate but you will need longer time for it;)

max
  • 3,915
  • 2
  • 9
  • 25