0

I wrote a neural network, its mostly based (bug fixed) on the neural nets from James McCaffrey https://visualstudiomagazine.com/articles/2015/04/01/back-propagation-using-c.aspx i came across various Git projects and books, using his code And as he worked for MS research i assumed his work would be good, maybe not top of the bill (its not running on top of cuda or so) but its code that i can read, although i'm not into the science side of it. His sample worked on a dataset much alike my problem.

I had the goal to solve some image classification (pixel info based data set) This problem wasn't easy to recreate but I managed to create a data set of 50 good scenarios and a 50 bad scenarios. When plotted out the measurements in a scatter diagram both sets had a lot fuzzy boundary overlappings. I myself was unable to make something out of it, it was to fuzzy for me.As I had 5 inputs per sample, I wondered if a neural net might be able to find the inner relations and solve my fuzzy data classification problem.

And well so it did .. well i kinda guess.
As depending on the seeding of weights (i got to 80%), the amount of nodes and the time of learning; I get training scores of around 90% to 85% and lately 95%

First I played with the random initialization of the weights. Then I played with the amount of nodes. The I played with Learn Rate,Momentum, and weight decay. they went from (scoring 85 to 90%):

// as in the example code i used
int maxEpochs = 100000;
double learnRate = 0.05;
double momentum = 0.01;
double weightDecay = 0.0001;

to (scores 95%)

int maxEpochs = 100000;
double learnRate = 0.02;  //had a huge effect
double momentum = 0.01;
double weightDecay = 0.001; //had a huge effect

I'm a bit surprised that the number of nodes had less effect as compared changing random initialization of the net, and changing the above constants.

However it makes me wonder.

  • As a general thumb-rule is 95% a high score ? (not sure where the limits are but i think it also depends on the data set, while I am amazed by 95% I wonder if it would be possible to tweak it to 97%.
  • The number of hidden nodes, should i try to minimize them ? currently its a 5:9:3 but I had a similar score once with a 5:6:3 network.
  • Is it normal for a neural network to have great scoring influence by changing initial random seed weights (different start seed) to get to a model; as i thought the training would overcome the start situation.
Peter
  • 2,043
  • 1
  • 21
  • 45

1 Answers1

1

First, sorry if I didn't understand correctly, but it looks like you have 100 training examples and no validation / test set. This is rather small for a training set, which makes it easy for the NN to overtrain on it. You also seem to have chosen a small NN, so maybe you actually don't overfit. The best way to check would be to have a test set.

As to your questions:

  • what a "good score" is depends entirely on your problem. For instance, on MNIST (widely used digit recognition dataset) this would be considered quite bad, the best scores are above 99.7% (and it's not too hard to get 99% with a ConvNet), but on ImageNet for instance that would be awesome. A good way to know if you're good or not is to compare to human performance somehow. Reaching it is usually hard, so being a bit below it is good, above is very good, and far below it is bad. Again this is subjective, and depends on your problem.

  • You should definetly try to minimize the number of hidden nodes, following Occam's rasor rule: among several models, the simplest is the best one. It has 2 main advantages: it will run faster, and it will generalize better (if two models perform similarly on your training set, the simplest one is most likely to work better on a new test set).

  • The initialization is known to change a lot the result. However, the big differences are rather between the different initialization methods: constant / simple random (widely used, usually (truncated) normal distribution) / random more clever (Xavier initialization for instance) / "cleverer" initializations (pre-computed features, etc. Harder to use). Between two random initializations generated exactly the same way, the difference in performance should not be that big. My guess is that in some cases, you just did not train long enough (the time needed for training properly can change a lot depending on initialization). My other guess is that the small size of your dataset and network makes the evaluation more dependent on initial weights than they usually are.

It is normal that the learning rate and weight decay change the result a lot, however finding the optimal values for those efficiently can be hard.

gdelab
  • 6,124
  • 2
  • 26
  • 59
  • Well actually the code generates a validation set out of training set, 80%train and 20% validation (by random separation), i should have mentioned it. If the set is to small i will try to automate the data harvesting, so I can more easily generate a larger training set. I'm not sure how large a dataset should should be, you mention "over fitting", i suppose you refer to that a neural net should not memorize (ea contain ) the data, but should be processing / sorting the data. Is there maybe then some thumbrule to number of nodes ?, or seize of the dataset, how to overcome over fitting ? – Peter May 23 '17 at 09:27
  • I'm afraid there is no clear rule to avoid overfitting. At any rate you should avoid having many more variables than different inputs bytes (n_samples * size_of_a_sample), but it can occur way before that. But you can spot it if your training accuracy is much better than your validation / test one (it will almost always be a bit better). You can avoid it by using some regularization techniques: batch normalization, L1 or L2 loss regularization, dropout, etc.; but at some point you'll have to reduce your network size to prevent it. – gdelab May 23 '17 at 10:15
  • As for the dataset, bigger is always better, and again it depends a lot on your problem, and on your NN size... But what I mean by "the small size of your dataset and network makes the evaluation more dependent on initial weights" is not only that the *network* depends more than usually on initialization, but also the *evaluation* itself: on so few test samples, the variance of any evaluation is quite high, so even very similar NN can give different evaluation results (just like a survey on 100 people is extremely inaccurate) – gdelab May 23 '17 at 10:20
  • I just red somewhere that the square root (input nodes x output nodes) should be enough as a general rule.. thats just 4 in my current case, and amazingly i still get 95%.. but i will put coding time into data set retrieving. What do you think of such a rule sqrt(input*output)=hidden nodes – Peter May 23 '17 at 10:34