I'm trying to implement mini batching correctly for my own NN.
But I can't wrap my head about what's being summed? Do I sum the Gradient or the delta weights (where the learning rate is already multiplied) for the weight and bias which in my example are:
Delta Weight: activation'(neurons) ⊗ Error * learningRate x input
Delta Bias: activation'(neurons) ⊗ Error * learningRate
Do I also divide those summed delta weights or gradients throug the batch size?
EDIT:
So all questions summed:
- Is the delta weight without the learning rate called the gradient?
- Do I need to add up those delta weights with or without the learning rate multiplied
- So I must save two seperate Gradients? (Bias + Weight)