Training loss increases after 12 epochs

Question

I have a model that learns to classify (binary classification) almost at 100% accuracy after 7-14 epochs but after reaching the minimum loss of 0.0004, in the next epoch it jumps to as much as 7.5 (which means it has a 50% chance of classifying correctly, same has pure chance) and then stays near 7. for all subsequent epochs.

I use adam optimiser which should take care of the learning rate.

How can I prevent the training loss from increasing?

This huge jump doesn't happen for SGD optimiser.

inputs = Input(shape=(X_train.shape[1],))
Dx = Dense(32, activation="relu")(inputs)
Dx = Dense(32, activation="relu")(Dx)
for i in range(20):
    Dx = Dense(32, activation="relu")(Dx)
Dx = Dense(1, activation="sigmoid")(Dx)
D = Model(input=[inputs], output=[Dx])
D.compile(loss="binary_crossentropy", optimizer="adam")

D.fit(X_train, y_train, nb_epoch=20)

score 2 · Answer 1 · answered May 17 '19 at 07:36

2

Your network is quite deep for a fully connected architecture. Most likely you have been hit by a vanishing- or exploding gradient, i.e. numerical problems caused by multiplying very small or very large numbers repeatedly. I'd recommend a shallower but wider network, with dense layers something like 2-3 layers is often enough in my experience. If you prefer working with the deeper architecture you could try out something like skip connections.

answered May 17 '19 at 07:36

Tapio

1,502
1
12
24

I am working on implementing a residual learning approach for standard dense architecture and binary classification. I find that it trains well for the first 5-10 epochs, then both training and validation loss explode suddenly by 10-15 orders of magnitude. I think this might be because I haven't found a way to skip connections at bottlenecks. I'm thinking some sort of average pooling to get the dimensions down. Do you have any suggestions? – rocksNwaves Oct 08 '20 at 01:00
Hmm, interesting question. So by bottlenecks do you mean that you are doing a autoencoder-style architecture but with fully connected layers? For sure skip connections are more established in convolutional architectures, skipping all the O(n²) connections will lead to a sphagetti network really fast. Maybe repeat the inputs again deeper in the network? Or just skip out 1 connection per neuron and hope it learns a 'signal bridge'? I've used soft attention to gatekeep skip connections previously, it anecdotally helps with the dimensionality a bit but I'm guessing it would get swamped here. – Tapio Oct 08 '20 at 11:10
One alternative path with similiar goals might be to use SELUs to re-normalize the activation units, that might help with the depth a bit ( https://arxiv.org/pdf/1706.02515.pdf ). Also I don't know your background, but if you haven't done a lot of work with neural networks it's pretty common that stuff just does not work or training diverges from unknown reasons. Also note that you always need at least an order of magnitude more data than in articles, I wouldn't even try to use a NN to train any model with an unique number of datapoints < 10^3 . – Tapio Oct 08 '20 at 11:14
Yes, essentially. Any time the number of neurons in the next layer is less than the previous is what I mean by bottleneck. But based on what you are saying, perhaps I entirely mis-understood the idea of shortcut connections. When reading the ResNet paper, I understood the process to be a simple addition of the input to a block to that same blocks outputs: y = F(x) + x. I wasn't really doing that connection by connection. I simply was doing it as straight across addition of inputs and outputs. As far as data goes, I have a ton of it. – rocksNwaves Oct 08 '20 at 15:00
1

You are completely right. Sorry I got confused and started thinking about wiring fully connected layers to each other for some reason. Your interpretation was right, you just add the weights. I've never dropped the dimensionality of a FCNN. I think you could just drop out every second weight and let the network learn the paths. In CNNs it's typical to use pointwise convolution for a learnable downsampling, there's no rule preventing you from doing a convolutional downsample. However my guess is that the problem is somewhere else, e.g. class imbalance or label noise or somesuch. – Tapio Oct 12 '20 at 17:43

score 1 · Answer 2 · edited Feb 25 '19 at 21:19

1

This might come from a small batch size. You may try to increase batch size.. referring to this.

edited Feb 25 '19 at 21:19

devDan

5,969
3
21
40

answered Feb 25 '19 at 19:19

MarcusM

15
6

These are not fluctuations, it's a consistent pattern – user7867665 Feb 26 '19 at 13:52

Training loss increases after 12 epochs

2 Answers2

Linked