I have a toy reinforcement learning project based on the REINFORCE algorithm (here's PyTorch's implementation) that I would like to add batch updates to. In RL, the "target" can only be created after a "prediction" has been made, so standard batching techniques do not apply. As such, I accrue losses for each episode and append them to a list l_losses
where each item is a zero-dimensional tensor. I hold off on calling .backward()
or optimizer.step()
until a certain number of episodes have passed in order to create a sort of pseudo batch.
Given this list of losses, how do I have PyTorch update the network based on their average gradient? Or would updating based on the average gradient be the same as updating on the average loss (I seem to have read otherwise elsewhere)?
My current method is to create a new tensor t_loss
from torch.stack(l_losses)
, and then run t_loss = t_loss.mean()
, t_loss.backward()
, optimizer.step()
, and zero the gradient, but I'm unsure if this is equivalent to my intents? It's also unclear to me if I should have been running .backward()
on each individual loss instead of concatenating them in a list (but holding on the .step()
part until the end?