4

I'm fairly new to NNs and I'm doing my own "Hello World" with LSTMs instead copying something. I have chosen a simple logic as follows:

Input with 3 timesteps. First one is either 1 or 0, the other 2 are random numbers. Expected output is same as the first timestep of input. The data feed looks like:

_X0=[1,5,9] _Y0=[1] _X1=[0,5,9] _Y1=[0] ... 200 more records like this. 

This simple(?) logic can be trained for 100% accuracy. I ran many tests and the most efficient model I found was 3 LSTM layers, each of them with 15 hidden units. This returned 100% accuracy after 22 epochs.

However I noticed something that I struggle to understand: In the first 12 epochs the model makes no progress at all as measured by accuracy (acc. stays 0.5) and only marginal progress measured by Categorical Crossentropy (goes from 0.69 to 0.65). Then from epoch 12 through epoch 22 it trains very fast to accuracy 1.0. The question is: Why does training happens like this? Why the first 12 epochs are making no progress and why epochs 12-22 are so much more efficient?

Here is my entire code:

from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, LSTM
from keras.models import Model
import helper
from keras.utils.np_utils import to_categorical

x_,y_ = helper.rnn_csv_toXY("LSTM_hello.csv",3,"target")
y_binary = to_categorical(y_)

model = Sequential()
model.add(LSTM(15, input_shape=(3,1),return_sequences=True))
model.add(LSTM(15,return_sequences=True))
model.add(LSTM(15, return_sequences=False))
model.add(Dense(2, activation='softmax', kernel_initializer='RandomUniform'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])
model.fit(x_, y_binary, epochs=100)
nemo
  • 55,207
  • 13
  • 135
  • 135
Manngo
  • 829
  • 7
  • 24

1 Answers1

2

It is hard to give a specific answer to this as it depends on many factors. One major factor that comes into play when training neural networks is the learning rate of the optimizer you choose.

In your code you have no specific learning rate set. The default learning rate of Adam in Keras 2.0.3 is 0.001. Adam uses a dynamic learning rate lr_t based on the initial learning rate (0.001) and the current time step, defined as

lr_t = lr * (sqrt(1. - beta_2**t) / (1. - beta_1**t)) .

The values of beta_2 and beta_1 are commonly left at their default values of 0.999 and 0.9 respectively. If you plot this learning rate you get a picture of something like this:

Adam dynamic learning rate for epoch 1 to 22

It might just be that this is the sweet spot for updating your weights to find a local (possibly a global) minimum. A learning rate that is too high often makes no difference at it just 'skips' over the regions that would lower your error, whereas lower learning rates take smaller step in your error landscape and let you find regions where the error is lower.

I suggest that you use an optimizer that makes less assumptions, such as stochastic gradient descent (SGD) and you test this hypothesis by using a lower learning rate.

nemo
  • 55,207
  • 13
  • 135
  • 135