0

I'm using pyTorch to implement a Q-Learning approach to card game, where the rewards come only at the end of the hand when a score is calculated. I am using experience replay with high gammas (0.5-0.95) to train the network.

My question is about how to apply the discounted rewards to the replay memory. It seems that the correct discounted reward depends on understanding, at some point, the temporal sequence of state transitions and rewards, and applying the discount recursively to the from the terminal state.

Yet most algorithms seem to apply the gamma somehow to a randomly-selected batch of transitions from the replay memory, which would seem to do-coordinate them temporally and make calculation of discounted rewards problematic. The discount in these algorithms seems to be applied to a forward pass of the "next_state", although it can be hard to interpret.

My approach has been to calculate the discounted rewards when the terminal state has been reached, and apply them directly to the replay memory's reward values at that time. I do not reference the gamma at replay time, since it has already been factored in.

This makes sense to me, but it is not what I see for example in the pyTorch "[Reinforcement Learning (DQN) Tutorial". Can someone explain how the time-decorrelation in random batches is managed for high-gamma Q-Learning?

black-ejs
  • 3
  • 3

1 Answers1

0

Imagine you are playing a simple game, where you move in a grid and collect coins. you're facing a common challenge in reinforcement learning: rewards come late and it's hard to know which actions were good or bad. In Q-Learning, you want to know how good it is to take a certain move (action) at a certain spot (state) on the grid. We call this the Q-value where you calculate the Q-value with this formula:

   Q(state, action) = reward + gamma * max_next Q(next_state, next_action)

the Q-value is the immediate reward and the best Q-value you can get in the next move. You save each move (state, action, reward, next_state) in a memory. During training, you randomly pick some of these moves to update the Q-values. This helps to avoid focusing too much on the recent moves. Although the moves are picked randomly, you still consider the sequence of rewards. This is done by looking at the next state of each move, which allows you to predict future rewards. This is represented by the gamma * max_next Q(next_state, next_action) part in the Q-value formula. Your approach of waiting until the end of the game to calculate the Q-values is a bit different. It's closer to the Monte Carlo method, where you only update the Q-values at the end of each game. It's like playing the whole game first, then deciding how good the moves were. This might work but could be less effective when the games are long. The traditional Q-Learning method, on the other hand, updates the Q-values as you play the game.

Keep in mind, in standard Q-Learning you don't need to calculate and store discounted rewards manually. Instead, you handle the discounting of future rewards in the Q-value update formula, and even when you use random batches of transitions for training, the future rewards are still taken into account in this update. This is how Q-Learning manages time decorrelation with high gamma.

I_Al-thamary
  • 3,385
  • 2
  • 24
  • 37
  • Thanks so much for this, I_Al-thamary. Again, I'm not sure I understand how this works in the case of a reward that arrives many steps in the future. How is gamma * max_next Q(next_state, next_action). calculated? As I understand it, the "action" saved in replay memory is actually the output of the DQN, to which something like argmax() is applied to determine the actual "action", which is then rewarded (typically once the new state is reached). If the instant reward is zero (usually true in my case), the discounted reward zero as well. What am I not understanding? – black-ejs May 21 '23 at 13:59
  • Q-learning handles delayed rewards by updating Q-values iteratively. Even if the immediate reward is zero, the max_a' Q(s', a') term, representing the best future reward, might not be zero if future rewards are expected. The agent learns which actions lead to these rewards and updates the Q-values accordingly, starting from actions closer to the reward and gradually working back to earlier ones. This "backpropagation of value" incorporates expectations of future rewards into earlier actions' Q-values, even if the immediate reward is zero. if rewards are delayed, Q-learning might learn slowly. – I_Al-thamary May 21 '23 at 14:11
  • 1
    Thanks again. Sorry for my ongoing confusion. `code` `code` When you say "the agent...updates the Q-values accordingly, starting from actions closer to the reward and gradually working back to earlier ones", if the batch was randomly selected from the replay memory, and the updated Q-values are not stored, how are the Q-values updated? The discounted reward they received - eventually - during online learning might not appear in the batch at all, and might be before or after it. – black-ejs May 21 '23 at 17:44
  • When I mentioned "starting from actions closer to the reward and gradually working back to earlier ones," I was simplifying it. In Q-Learning, Q-values, stored in a table or approximated by a neural network, are updated iteratively using experiences randomly drawn from the replay memory. Each experience consists of a state, action, reward, and the next state. – I_Al-thamary May 21 '23 at 18:02
  • For each experience, the Q-value of the action taken is updated towards the sum of the immediate reward and the discounted future reward (estimated using the current Q-values and the max_a' Q(next_state, a') term). Even if the immediate reward is zero, this update process still accounts for future rewards. The iterative process doesn't require the complete sequence of experiences leading to a reward in the same batch. Q-value estimates gradually improve, reflecting both immediate and future rewards, as different experiences are sampled over time. – I_Al-thamary May 21 '23 at 18:03
  • Thank you once again. It seems I have more understanding to do. Since I have "a state, action, reward, and the next state", how do I determine ` max_Q(next_state, a') ` ? By "processing" next_state and observing... the reward? (for me this will be zero most times). The value of the argmax() output unit? Is this a better measure than the (often non-zero) gamma-discounted actual-experience? ` I see that a stored value might be stale re:f the action taken, but if the replay buffer is recent it would seem a rich expression of the Q-value.` `Very much appreciate the help. – black-ejs May 21 '23 at 19:14
  • max_a' Q(s', a') is used to calculate the Q-value update, which is the target for training your Q-network. The Q-values themselves are the measures of action quality, with higher Q-values indicating better actions. So, it's not based on the immediate reward after performing the action, but the long-term reward. This is crucial in problems where rewards are delayed. Your replay buffer indeed provides a rich set of experiences for the agent to learn from. The Q-values are not static but they are continually updated as the agent gains more experience, making the learning process more robust. – I_Al-thamary May 21 '23 at 19:38
  • See these : [handle delayed reward in reinforcement learning](https://www.reddit.com/r/reinforcementlearning/comments/tt19mu/how_to_deal_with_delayed_dense_rewards/) and [An example task for delayed rewards](https://ml-jku.github.io/rudder/) – I_Al-thamary May 21 '23 at 19:47
  • 1
    Too new here to reward you correctly, so I can only use words. I think I understand a bit more. You have been generous with your time and expertise. Thanks much. – black-ejs May 21 '23 at 21:56