While calculating TD error of the target network in Prioritized Experience Replay, we have from the paper equation 2) in Appendix B:
$$\delta_t := R_t + \gamma max_a Q(S_t, a) - Q(S_{t-1}, A_{t-1})$$
It seems unnecessary / incorrect to me that the same formula applies if $S_t$ is a terminal state. This is because when calculating the error while updating the action network, we take special care of the terminal state and don't add a reward to go term (such as the $\gamma max_a Q(S_t, a)$ above). See here for example: https://jaromiru.com/2016/10/03/lets-make-a-dqn-implementation/ .
My question is:
- Should terminal states be handled separately when calculating TD error for Prioritized Experience Replay?
- Why / why not?