0

If I understand correctly, when using deep learning with mini batches, we have a forward and backward pass in every mini batch (with the corresponding optimizer). But does something different happen at the end of the epoch (after using all mini batches)?

The reason I'm asking is that in my implementation of a u-net for image segmentation, I see that with every mini batch the loss slightly decreases (in the order of 0.01). Then, when the new epoch starts, the loss in the first mini batch with respect to the last mini batch in the previous epoch changes a lot (in the order of 0.5). Also, after the first epoch, the loss for the test data is in the order of the loss for the first mini batch in the next epoch.

I would interpret this as if the weights are updated faster at the end of the epoch than for the different mini batches, but I have found no theory supporting this. I would appreciate an explanation.

As for the optimizer, this is happening both with stochastic gradient descent and with Adam. If it helps, I am using Keras.

alvgom
  • 173
  • 1
  • 7
  • 1
    It seems that your question is focused on machine learning, and is not directly related to programming, making it off-topic here. You may find that [Cross Validated](https://stats.stackexchange.com) or [Data Science SE](https://datascience.stackexchange.com) are a better fit for these questions. – E_net4 Oct 12 '17 at 10:19
  • 1
    While i'm not able to follow your description completely, i'm very very positive, that you are doing some misinterpretation about keras' output. Make sure you do understand the underlying calculations for the verbose output: **mean!** -> (without checking, e.g. something like: mean after 1 mini-batch in this epoch; mean of 2 mini-batches and so on... surely later iterations will be lookin more stable as the mean is not changed that much then) – sascha Oct 12 '17 at 10:31
  • It's hard to say without looking at your code (how the data is split into batches, is it reshuffled, what exactly is reported, etc), but all batches have more or less the same effect on the network. The only possible difference is last batch size, e.g. if the training size is `100` and batch size is `30`, then the last mini-batch is `10` (again, depends on actual implementation) – Maxim Oct 12 '17 at 11:00

0 Answers0