I'm having a little difficulty understanding how I can apply backpropagation through time to the A2C method, or any reinforcement learning method for that matter.
As I understand it, BPTT conceptually unrolls a recurrent network and performs a forward pass, then takes the output from this pass, calculates a loss and uses this to backpropagate through the network, taking into account the previous states of the network. However, I'm slightly unsure how I would go about combining this with A2C. Should I calculate the final actor and critic losses from an epoch and use these to backpropagate, or should I accumulate the total losses at each step and do the same, or have I misunderstood entirely and need to do something else?
Thanks in advance for any advice.