I'm training a simple neural network with a single Dense layer on the MNIST dataset in Keras.
This is the code:
model = Sequential()
model.add(Input(shape=(28, 28)))
model.add(Flatten())
model.add(Dense(10, activation='sigmoid'))
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
loss='categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=32, epochs=10)
This is the output when the learning rate is 0.01:
Epoch 1/10
1875/1875 [==============================] - 2s 946us/step - loss: 315.4696 - accuracy: 0.8432 - val_loss: 195.9139 - val_accuracy: 0.8957
Epoch 2/10
1875/1875 [==============================] - 2s 877us/step - loss: 263.0978 - accuracy: 0.8674 - val_loss: 233.7138 - val_accuracy: 0.8782
Epoch 3/10
1875/1875 [==============================] - 2s 889us/step - loss: 251.8907 - accuracy: 0.8730 - val_loss: 208.0299 - val_accuracy: 0.8906
Epoch 4/10
1875/1875 [==============================] - 2s 882us/step - loss: 246.9039 - accuracy: 0.8754 - val_loss: 229.8979 - val_accuracy: 0.8937
Epoch 5/10
1875/1875 [==============================] - 2s 876us/step - loss: 234.6116 - accuracy: 0.8786 - val_loss: 263.7991 - val_accuracy: 0.8682
Epoch 6/10
1875/1875 [==============================] - 2s 942us/step - loss: 239.2780 - accuracy: 0.8781 - val_loss: 217.1707 - val_accuracy: 0.8892
Epoch 7/10
1875/1875 [==============================] - 2s 943us/step - loss: 235.9433 - accuracy: 0.8805 - val_loss: 233.0448 - val_accuracy: 0.8926
Epoch 8/10
1875/1875 [==============================] - 2s 941us/step - loss: 237.9058 - accuracy: 0.8812 - val_loss: 229.1561 - val_accuracy: 0.8912
Epoch 9/10
1875/1875 [==============================] - 2s 888us/step - loss: 235.2525 - accuracy: 0.8826 - val_loss: 318.9307 - val_accuracy: 0.8683
Epoch 10/10
1875/1875 [==============================] - 2s 885us/step - loss: 238.1098 - accuracy: 0.8810 - val_loss: 275.0455 - val_accuracy: 0.8809
And this is the output when it is 0.03, all other hyper-parameters are fixed:
Epoch 1/10
1875/1875 [==============================] - 2s 1ms/step - loss: 931.7540 - accuracy: 0.8417 - val_loss: 618.5505 - val_accuracy: 0.8952
Epoch 2/10
1875/1875 [==============================] - 2s 945us/step - loss: 767.9313 - accuracy: 0.8701 - val_loss: 618.2877 - val_accuracy: 0.8940
Epoch 3/10
1875/1875 [==============================] - 2s 892us/step - loss: 756.3298 - accuracy: 0.8730 - val_loss: 847.1705 - val_accuracy: 0.8582
Epoch 4/10
1875/1875 [==============================] - 2s 956us/step - loss: 739.8559 - accuracy: 0.8748 - val_loss: 687.9159 - val_accuracy: 0.8901
Epoch 5/10
1875/1875 [==============================] - 2s 888us/step - loss: 731.3071 - accuracy: 0.8760 - val_loss: 693.1130 - val_accuracy: 0.8942
Epoch 6/10
1875/1875 [==============================] - 2s 877us/step - loss: 728.4488 - accuracy: 0.8787 - val_loss: 685.3834 - val_accuracy: 0.8841
Epoch 7/10
1875/1875 [==============================] - 2s 878us/step - loss: 712.8240 - accuracy: 0.8798 - val_loss: 640.9078 - val_accuracy: 0.8972
Epoch 8/10
1875/1875 [==============================] - 2s 890us/step - loss: 693.1299 - accuracy: 0.8811 - val_loss: 657.0080 - val_accuracy: 0.8902
Epoch 9/10
1875/1875 [==============================] - 2s 884us/step - loss: 700.5771 - accuracy: 0.8803 - val_loss: 739.0408 - val_accuracy: 0.8871
Epoch 10/10
1875/1875 [==============================] - 2s 897us/step - loss: 696.2348 - accuracy: 0.8833 - val_loss: 785.1879 - val_accuracy: 0.8762
I tried this multiple times, so this isn't a random thing. I tried with RMSprop as well same results.
From my understanding the decrease in the loss should be proportional to the learning rate not the loss itself.
Is this related to how Keras calculates the loss function somehow?