I've been trying to learn the math behind neural networks and have implemented (in Octave) a version of the following equations which include bias terms.
Back-propagation equations matrix form:
Visual representation of the problem and Network:
clear; clc; close all;
#Initialize weights and bias from input to hidden layer
W1 = rand(3,4)
b1 = ones(3,1)
#Initialize weights from hidden to output
W2 = rand(2,3)
b2 = ones(2,1)
#define sigmoid function
s = @(z) 1./(1 + exp(-z));
ds = @(z) s(z).*(1-s(z));
data = csvread("data.txt");
for j = 1 : 100
for i = 1 : length(data)
x0 = data(i,2:5)';
#Find the truth
if data(i,6) == 1 ;
t = [1;0] ;
else
t = [0;1];
end
#Forward propagate
x1 = s(W1*x0 + b1);
x2 = s(W2*x1 + b2);
iter = (j-1)*length(data) + i;
E((j-1)*length(data) + i) = norm(x2-t)^2;
E(length(E))
#Back propagate
delta2 = (x2-t).*ds(W2*x1+b2);
delta1 = W2'*delta2.*ds(W1*x0+b1);
dedw2 = delta2*x1';
dedw1 = delta1*x0';
alpha = 0.001*(40000-iter)/40000;
W2 = W2 - alpha*dedw2;
W1 = W1 - alpha*dedw1;
b2 = b2 - alpha*delta2;
b1 = b1 - alpha*delta1;
end
end
plot(E)
title('Gradient Descent')
xlabel('Iteration')
ylabel('Error')
When I run this, I converge on weights that give an constant error of 0.5 rather than 0.0. The error plot looks something like this depending on the initial samples of W1
and W2
:
The resulting weights W1
and W2
yield output ~[0.5,0.5] for the whole set rather than [1,0](isStairs = true) or [0,1](isStairs = False)
Other information:
- If I loop over a single data point instead of the entire learning set, it does converge to zero error for that particular case. (like 20 iterations or so), so I assume my derivatives are correct?
- For the model to converge the learning rate has to be insanely small. Not sure what this means.
Is this neural network valid to solve the described problem? If so, what does it mean to converge to an error of 0.5?