I have a generic network without random element in his structure (e.g. no dropout) so that if I forward a given image input through the network, I put gradient to zero and repeat again the forward with the same image input I get the same result (same gradient vector, output,…) Now let’s say that we have a batch of N elements (data, label) and I perform the following experiment:
- forward the whole batch and store the gradient vector (using
reduction='sum'
in my criterion), use backward to generate the corresponding gradient, save it in a second object (that we’ll refer to as Batch_Grad)
output = model(data)
loss = criterion(output,torch.reshape(label, (-1,)))
loss.backward()
Batch_Grad= []
for p in model.parameters():
Batch_Grad.append(p.grad.clone())
- reset the gradient
optimizer.zero_grad()
- repeat the first point giving in input batch’s elements one by one and collect after each backward the corresponding element’s gradient (resetting the gradient every time after that)
for i in range(0, len(label)):
#repeat the procedure of point 1. for each data[i] input
#...
optimizer.zero_grad()
Sum up togheter gradient vectors of the previous point corresponding to each element of the given batch in a single object (that we’ll refer to as Single_Grad)
compare the objects of point 4. and 1. (Batch_Grad and Single_Grad)
Following the above procedure I find that tensor from point 1. and 5. are equal only if the batch size (N) is equal to 1, but they are different for N>1.
With the method of point 3. and 4. I'm manually summing gradients associated to single image propagation (which as pointed in the above comment are equals to the ones calculated automatically by SGD, with N=1). Since automatic SGD approach (point 1.)is also expected to perform the same sum: Why do I observe this difference?