Odd behavior of cost over time with SGD

Question

I am relatively new to ML/DL and have been trying to improve my skills by making a model that learns the MNIST data set without TF or keras. I have 784 input nodes, 2 hidden layers of 16 neurons each, and 10 output nodes corresponding to which number the model thinks a given picture is. Sigmoid is the only activation function I used (I know this is sub-optimal). I trained 200k epochs of pure SGD (batch size of 1 image) and plotted the cost every 10 epochs. My question is this: What is the explanation for this weird behavior of the cost over time?

score 1 · Answer 1 · answered May 23 '19 at 03:17

No one can be sure of exactly what's happening (especially since you haven't provided any code) but running for 200k epochs with a batch size of 1 immediately stands out as a red flag for me. If you indeed are using a batch size of 1, then the gradient descent will be quite noisy and high variance. 200k passes through all the training data also seems like you are forcing your model to overfit (for reference a few hundred epochs or less is usually sufficient for most results)

Odd behavior of cost over time with SGD

1 Answers1