I am adapting code for training a neural network that does online training to work for mini-batches. Is the mini-batch gradient for a weight (de/dw) just the sum of the gradients for the samples in the mini-batch? Or, is it some non-linear function because of the sigmoid output functions? Or, is it the sum but divided by some number to make it smaller?
Clarification: It is better to pose this question more specifically and ask about the relationship between the full-batch gradient and online gradient. Thus, see next para:
I am using neurons with a sigmoid activation function to classify points in a 2-d space. The architecture is 2 x 10 x 10 x 1. There are 2 output classes: some points are 1 and others 0. The error is half the square of (target - output). My question is, is the full batch gradient equal to the the sum of the gradient of each sample (holding weights constant across the batch)?