In Neural NEtwork Multilayer Perceptron, I understand that the main difference between Stochastic Gradient Descent (SGD) vs Gradient Descent (GD) lies in the way of how many samples are chosen while training. That is, SGD iteratively chooses one sample to perform forward pass followed by backpropagation to adjust the weights, as oppose to GD where the backpropagation starts only after the entire samples have been calculated in the forward pass).
My questions are:
- When the Gradient Descent (or even mini-batch Gradient Descent) is the chosen approach, how do we represent the error from a single forward pass? Assuming that my network has only a single output neuron, is the error represented by averaging all the individual errors from each sample or by summing all of them?
- In MLPClassifier scikit learn, does anyone know how such error is accumulated? Averaging or summing?
Thank you very much.