Why are Neural Networks with same properties different?

Question

Introduction

I'm very new to Artificial Intelligence, Machine Learning, and Neural Network.

I tried to code some stuff with the help of the FANN (Fast Artificial Neural Network) library (C++) for testing the capability of this kind of system.

Programming

I made a little piece of code that generate a learning file to process a supervised learning. I've already done some test but this one what made to understand the relation between hidden layers' organization, and AI capability, to solve the same problem.

To explain my observation, I will use notation A-B-C-[...]-X to picture a configuration of A input neurons, B neurons on the first hidden layer, C neurons on the second, ..., and X output neurons.

In those test, the learning data was 2k random result of a working NOT function (f(0)=1 ; f(1)=0) (equivalent of '!' in many langages). Note too that an Epoch represent 1 training test over all the learning data. "AI" will represent a trained ANN.

No error have been made in the learning data.

You can find the entire source code on my GitHub Repo.

More is not better

First, I noticed that 1-1-1 system is more powerful in 37 Epochs than 1-[50 layers of 5 neurons]-1 is in 20k Epochs (0.0001 error rate against 0.25).

My first though was that the second AI needed more training, because there are a lot more costs to minimize, but I ain't sure this is the only reason.

This lead me to try some tests with the same total number of neurons.

Equal is not equal

1-2-2-1 configuration seems more efficient than 1-4-1

Actually, when I run a test over those two different configurations, I got those outputs (testing program coded on my own). Those are two different tests, "9**" is the current index of the test.

The test consist in giving random int between 0 and 1 to the AI and printing the output. Each test has been run separately.

// 1-2-2-1
[936]Number : 0.000000, output : 1.000000
[937]Number : 1.000000, output : 0.009162
[938]Number : 0.000000, output : 1.000000
[939]Number : 0.000000, output : 1.000000
[940]Number : 1.000000, output : 0.009162
[941]Number : 0.000000, output : 1.000000
[942]Number : 0.000000, output : 1.000000

// 1-4-1
[936]Number : 0.000000, output : 1.000000
[937]Number : 0.000000, output : 1.000000
[938]Number : 1.000000, output : 0.024513
[939]Number : 0.000000, output : 1.000000
[940]Number : 0.000000, output : 1.000000
[941]Number : 1.000000, output : 0.024513
[942]Number : 1.000000, output : 0.024513

You can notice that the first config gives a result nearer to 0 than the second one. (0.009162 against 0.024513). That's not a IEEE encoding issue, and those 2 values don't change if I run another test.

What's the reason of that ? Let's try to figure it out.

How many "synapse" do we have on the first config ?

first

first[0]->second[0]
first[0]->second[1]

then

second[0]->third[0]
second[0]->third[1]
second[1]->third[0]
second[1]->third[1]

final

third[0]->first[0]
third[1]->first[0]

So we get a total amount of 2 + 4 + 2 = 8 synapses. (and so 8 different weights possibilities).

What about the second configuration ?

first

first[0]->second[0]
first[0]->second[1]
first[0]->second[2]
first[0]->second[3]

final

second[0]->third[0]
second[1]->third[0]
second[2]->third[0]
second[3]->third[0]

So we get a total of 4 + 4 = 8 synapses. (still 8 different weights possibilities).

And on both systems we have 4 activation functions (1 for each neuron).

How can we get a significative difference of capability with same properties ?

This question is not a very good fit for StackOverflow, you may want to try [Cross Validated](http://stats.stackexchange.com/questions/tagged/neural-networks?sort=votes&pageSize=50) instead. — Ivan Aksamentov - Drop, Dec 30 '16 at 16:33
It is hard to tell. Neural networks are complex beasts. They have a lot of knobs and buttons ("hyperparameters") to tweak and hack. They are also known in machine learning community as "black boxes", the kind of models that do not explain themselves (either why they work or why they don't or why one set of hyperparameters is better than others). There is still [a lot of research](https://arxiv.org/list/cs.NE/recent) going on. In order to train good models and understand them better you will need good tools, such as modern NN frameworks: TensorFlow, Theano, Torch etc. — Ivan Aksamentov - Drop, Dec 30 '16 at 16:39

score 3 · Answer 1 · answered Dec 30 '16 at 16:53

In general, having a lot nodes & weights can lead to the neural network being overspecialised. In a bit of an extreme example: if you have a few thousand images, a neural network with billion billion nodes (and many more weights) will risk learning every single pixel in your training data instead of finding out the concepts "eye", "ears", ... that compose a "face". So when you present that overspecialised neural network with different images it won't work on them (or at least not that well). It hasn't worked out the abstract concepts (e.g. "a cat has a ears and eyes" while "a house has windows").

Whilst in your tests there isn't much to overspecialise in, you might still see some (minor) effect of this.

Same properties: The number of weights is the same, however the structure is different. A neural network consisting of a straight line of nodes (1-1-1-...-1) will behave quite different from a more compact one (1-20-1). A 1-1-1-...-1 network might not even be able to learn what a 1-20-1 network can learn (there are some rules on the number of nodes/weights you need to learn boolean algebra, though i don't recall them).

I suspect, that a "1-4-1" shows more deviation from the expected result as each intermediate node is affected by more weights - the more weights to get right per node, the longer the training will take.

In a 1-2-2-1 network the first intermediate layer only has a single weight as input per node, the second intermediate layer has 2 weights per node and the output layer has two weights per node. So at most, you can "wiggle around" two values per intermediate node.

I don't know the details of your neural network (the function to which the weight is applied), but if you think of the following example, it might make things clear:

lets assume the function is f(weight, input) = input*weight + constant
further our network is 1-1 (i.e. it has one weight to determine)

If that one weight is -1 and the constant 1, you have your negation function. That neural network would beat any bigger network both in training speed and in accuracy. Any network with more nodes (and thus weights) has to work out the delicate balance of all weights until it finds one that represents the concept "negation" - probably that neural network would have a lot of zeros in it (i.e. "ignore that input") and one path that does the negation.

As food for further work: neural networks are good with fuzzy data and less good to get algebra functions right down to the 10th digit.