Appropriate backpropagation parameters

Question

I want to train a neural network to perform signal classification.

The network has 50 inputs of the format: [-1 .. 1]

50 hidden layers (not restricted)

10 outputs

hyperbolic tangent (not restricted)

I am restricted to library (hnn) to do the training.

My problem is that I do not know what is the appropriate learning rate and the number of training iterations

I have tried many possible settings in the range:

[1K - 10K] training iterations

[0.001 - 1.5] learning rate

But when I feed my training data again into the trained neural network, I get very bad results (in the form of confusion matrix) - at most 2 classes classified correctly.

What is the appropriate set of these two parameters for the input data?

While searching for the similar cases in literature, I discovered that different cases use different parameter setting without really explaining the reasoning.

Experiments: Mentioned library has a function trainUntilErrorBelow (self-explanatory). I have used this function to see how fast I can reach a certain error by changing activation function and the number of hidden layers.

I have chosen the following:

minimum error: 300

learning rate: 0.01

Results: Hyperbolic tangent:

1 hidden layer (50 neurons) - 32.12 sec

2 hidden layers (50/50 neurons) - 31.51 sec

3 hidden layers (50/50/50 neurons) - 12.18 sec

4 hidden layers (50/50/50/50 neurons) - 42.28 sec

Sigmoid:

1 hidden layer (50 neurons) - 21.32 sec

2 hidden layers (50/50 neurons) - 274.29 sec

3 hidden layers (50/50/50 neurons) - ∞ sec

4 hidden layers (50/50/50/50 neurons) - ∞ sec

Is it reasonable to assume that the hyperbolic tangent activation function with 3 hidden layers (50/50/50 neurons) is a good choice for the network architecture?

In general you need trial-and-error for these hyper-parameters. This is the reason for observing a huge range regarding these in literature. I'm pretty sure the problem in your case is not the learning-rate, but the network architecture and maybe your dataset. When you already tried some learning-rates, you should invest some time regarding regulization. Your network might be overfitting all the time. Add some L1/L2 regulization to your weights or even something like a dropout layer. — sascha, Jun 01 '16 at 00:36
@sascha thanks, trying to run some experiments to determine appropriate settings for the hyperparameters — Boris Mocialov, Jun 02 '16 at 13:06
@MocialovBoris As Sascha said, it looks more like a problem with the data being classified. For signal classification, you normally (often ;) ) calculate some features form the signals themselves (mean, fourier transform, blah-coefficients) and then classify those features. The one thing here is which features you take (art more than science); the other, more or less a rule of thumb, to normalize the data, most commonly in [-1,1]. What are your 50 inputs? What sort of values they have? Why you have 50 output nodes (you mentioned 10 before)? — Luis, Jun 02 '16 at 13:32
@Luis 50 inputs are extracted features from visual data; 10 outputs - classes. I am not denying that the problem can be with the dataset. I want to pinpoint what exactly is causing poor training. P.S. outputs are 10, 50/50/... are neurons in hidden layers — Boris Mocialov, Jun 02 '16 at 13:35

score 0 · Answer 1 · answered Jun 02 '16 at 15:01

(Was intended as comment, but got too big :P)

I think that a most useful tool here would be to look at the learning curves to see if the weights are moving (you should see a curve going down like the blue one there. From there, you can play with the parameters. Things to think about: The learning rate may be too big or too low, means the changes in weights would be big or little in each iteration. The former may lead to not convergence, the latter to slow convergence. If the weights change too much, then you may miss some nice error minima. In any case, the plots will definitely give you some hints on what is going on. This also applies with the momentum (if you use it): sometimes you put a too-big value there and the weights gain impetus and also miss the minima.

Training iterations: I always train 200 to 500 epochs and have a look at the learning plots. If I decide to go with a particular configuration, I train for lots (10,000) epochs, go for something to eat, and have a look at the plots again to check that there was nothing weird while I was away ;) Most of the times I see little change after epoch 1000 (at least, the trend stays going down at the same pace).

Another comment (with all due caution): I don't know your problem, but I have always used only 1 hidden input and it works. Here I see some changes in the number of hidden nodes, which is a problem by itself. For the first try, I mostly go with num_hidden = num_inputs. I would humbly suggest to start looking with smaller, simpler networks first. ;)

Appropriate backpropagation parameters

1 Answers1