1

I am training my mobilenet v3 with tfrecords produced by tensorflow model. The training loss w.r.t steps is plotted below. Unit length in x axis is 20k steps (2 epochs approximately due to batch size=128 and 1281167 samples totally).

I exponential decay learning rate 0.01 every 3 epochs with staircase, and the loss falls normally in first 4 epochs. However, the loss rises and falls every epoch after 4-th epoch. I have tried momentum optimizer(painted orange) and rmsprop optimizer(painted blue), then get similar results. Please help me to troubleshoot this problem.

Each unit

1 Answers1

1

The periodicity is almost certainly aligned to 1 full epoch.

It's natural for your model to have a random variation in loss for different batches. You are seeing this random variation repeated over and over as the weights are stabilised so you just see (roughly) the same loss for each batch over and over with every epoch.

I'm not sure it needs troubleshooting but if you really want to avoid it you could shuffle your dataset between epochs

Stewart_R
  • 13,764
  • 11
  • 60
  • 106
  • my order of making tf dataset is: preprocess->shuffle(1000)->repeat->batch(128). Is the order correct? Is the size of shuffle buffer too small for image-net? – Euphoria Yang Aug 06 '19 at 03:29
  • Not sure there is a correct and incorrect here. If you switched `repeat` and `shuffle` around it would shuffle each epoch. This will remove the repeating pattern but I'm not expecting any real improvement in performance. The reason the pattern is noticeable is that your model is more or less optimised as far as it can go and just (naturally and expectedly) performs better on some batches than on others – Stewart_R Aug 06 '19 at 09:05