I'm new to deep reinforcement learning, DQN model. I used Open AI gym to reproduce some experiment named CartPole-v0
and MountainCar-v0
respectively.
I referred code from Github, CartPole-v0
:https://gist.github.com/floodsung/3b9d893f1e0788f8fad0e6b49cde70f1
MountainCar-v0
:https://gist.github.com/floodsung/0c64d10cab5298c63cd0fc004a94ba1f
.
Both model can run successfully and get the reward of test episodes as expected. But the reward for each time step of the two models are different.
For CartPole-v0
, the reward is +1 and 0. Each episode has 300 time steps, and the agent tries to as much total reward as it can. The source code follows: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
But in MountainCar-v0
, the reward is always -1 for all the actions, so the agent is trying to end up getting less negative reward than usual. It also explained here How does DQN work in an environment where reward is always -1.
So it makes me confused how to determine the reward for actions or states? It seems both a positive reward or a negative reward make sense with in limited time step? What's the principle to choose which one to use. And I see sometimes the reward can be float number between them.
And how to avoid the case "suicide itself", where the agent suicide itself instead of trying to reach the target, because of the "live penalty" (the agent receives a penalty each step, to speed up the exploitation phase over the exploration). >https://datascience.stackexchange.com/questions/43592/rl-weighthing-negative-rewards
Thanks in advance!