Why do openai gym return reward zero for terminal states?

Question

I've been experimenting with Gym (and RL) a lot lately and there is one specific behaviour of gym that has piqued my interest. Why is it that OpenAI Gym return reward 0 even when game is over? For e.g, in Breakout-v0, when all five lives are spent, env.step will return done=True and reward=0. Shouldn't we notify agent that such a state is unfavourable by returning a negative reinforcement/reward ?

Also, for every step on the environment (still Breakout-v0), it will return reward 0 if no bricks/blocks were destroyed at that time. So how will the an agent be able to differentiate between a normal action and bad action?

Please read [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) before attempting to ask more questions. — , Mar 10 '18 at 16:40

Alex · Answer 1 · 2018-03-10T17:21:30.023

1

Question 1: Reward does not matter when done == True. You should reset the environment by calling env.reset() when done.

Question 2: The rewards are the discounted sums over the entire lifetime of the trajectory.

edited Mar 10 '18 at 17:21

answered Mar 10 '18 at 16:40

Alex

18,484
8
60
80

Even if `done == True`, we still add the experience to replay buffer, which may well be sampled for training the online network in the future. So how can reward not matter? – Nilesh PS Mar 10 '18 at 16:44
@NileshPS You shouldn't be adding experience to a replay buffer after the environment is done. – Alex Mar 10 '18 at 16:45
[DQN algorithm](https://reading-club.github.io/assets/posts/Playing_Atari_with_Deep_Reinforcement_Learning/algorithm.png) If what you said is correct, then **y_j** should be set unconditionally since **phi_(j + 1)** is never a terminal state. – Nilesh PS Mar 10 '18 at 16:48
@NileshPS here's an implementation https://github.com/jalexvig/berkeley_deep_rl/blob/master/hw3/dqn.py – Alex Mar 10 '18 at 17:11
@NileshPS `phi_{j+1}` can be terminal. But that should be the last tuple of experience added to the buffer. That can still have a nonzero reward `r_{j}`. You should never add a tuple to the buffer where both `phi_j` and `phi_{j+1}` are terminal – Dennis Soemers Mar 10 '18 at 17:12
@Alex I took a look and it saves experience to the buffer even if `done=True`. Lines 266 - 270 – Nilesh PS Mar 10 '18 at 17:14
@NileshPS that's only one timestep since the environment is reset. – Alex Mar 10 '18 at 17:17
@Alex Got it. Thanks :-) I am still not convinced about the second part of my question though. – Nilesh PS Mar 10 '18 at 17:19

Why do openai gym return reward zero for terminal states?

1 Answers1