I'm using pyTorch to implement a Q-Learning approach to card game, where the rewards come only at the end of the hand when a score is calculated. I am using experience replay with high gammas (0.5-0.95) to train the network.
My question is about how to apply the discounted rewards to the replay memory. It seems that the correct discounted reward depends on understanding, at some point, the temporal sequence of state transitions and rewards, and applying the discount recursively to the from the terminal state.
Yet most algorithms seem to apply the gamma somehow to a randomly-selected batch of transitions from the replay memory, which would seem to do-coordinate them temporally and make calculation of discounted rewards problematic. The discount in these algorithms seems to be applied to a forward pass of the "next_state", although it can be hard to interpret.
My approach has been to calculate the discounted rewards when the terminal state has been reached, and apply them directly to the replay memory's reward values at that time. I do not reference the gamma at replay time, since it has already been factored in.
This makes sense to me, but it is not what I see for example in the pyTorch "[Reinforcement Learning (DQN) Tutorial". Can someone explain how the time-decorrelation in random batches is managed for high-gamma Q-Learning?