-1

I've been experimenting with Gym (and RL) a lot lately and there is one specific behaviour of gym that has piqued my interest. Why is it that OpenAI Gym return reward 0 even when game is over? For e.g, in Breakout-v0, when all five lives are spent, env.step will return done=True and reward=0. Shouldn't we notify agent that such a state is unfavourable by returning a negative reinforcement/reward ?

Also, for every step on the environment (still Breakout-v0), it will return reward 0 if no bricks/blocks were destroyed at that time. So how will the an agent be able to differentiate between a normal action and bad action?

Nilesh PS
  • 356
  • 3
  • 8
  • 1
    Please read [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) before attempting to ask more questions. –  Mar 10 '18 at 16:40

1 Answers1

1

Question 1: Reward does not matter when done == True. You should reset the environment by calling env.reset() when done.

Question 2: The rewards are the discounted sums over the entire lifetime of the trajectory.

Alex
  • 18,484
  • 8
  • 60
  • 80
  • Even if `done == True`, we still add the experience to replay buffer, which may well be sampled for training the online network in the future. So how can reward not matter? – Nilesh PS Mar 10 '18 at 16:44
  • @NileshPS You shouldn't be adding experience to a replay buffer after the environment is done. – Alex Mar 10 '18 at 16:45
  • [DQN algorithm](https://reading-club.github.io/assets/posts/Playing_Atari_with_Deep_Reinforcement_Learning/algorithm.png) If what you said is correct, then **y_j** should be set unconditionally since **phi_(j + 1)** is never a terminal state. – Nilesh PS Mar 10 '18 at 16:48
  • @NileshPS here's an implementation https://github.com/jalexvig/berkeley_deep_rl/blob/master/hw3/dqn.py – Alex Mar 10 '18 at 17:11
  • @NileshPS `phi_{j+1}` can be terminal. But that should be the last tuple of experience added to the buffer. That can still have a nonzero reward `r_{j}`. You should never add a tuple to the buffer where both `phi_j` and `phi_{j+1}` are terminal – Dennis Soemers Mar 10 '18 at 17:12
  • @Alex I took a look and it saves experience to the buffer even if `done=True`. Lines 266 - 270 – Nilesh PS Mar 10 '18 at 17:14
  • @NileshPS that's only one timestep since the environment is reset. – Alex Mar 10 '18 at 17:17
  • @Alex Got it. Thanks :-) I am still not convinced about the second part of my question though. – Nilesh PS Mar 10 '18 at 17:19